System and Method for Multi-Channel Noise Suppression Based on Closed-Form Solutions and Estimation of Time-Varying Complex Statistics

ABSTRACT

Multi-channel noise suppression systems and methods are described that omit the traditional delay-and-sum fixed beamformer in devices that include a primary speech microphone and at least one noise reference microphone with the desired speech being in the near-field of the device. The multi-channel noise suppression systems and methods use a blocking matrix (BM) to remove desired speech in the input speech signal received by the noise reference microphone to get a “cleaner” background noise component. Then, an adaptive noise canceler (ANC) is used to remove the background noise in the input speech signal received by the primary speech microphone based on the “cleaner” background noise component to achieve noise suppression. The filters implemented by the BM and ANC are derived using closed-form solutions that require calculation of time-varying statistics of complex frequency domain signals in the noise suppression system.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication No. 61/413,231, filed on Nov. 12, 2010, which isincorporated herein by ieference in its entirety.

FIELD OF THE INVENTION

This application relates generally to systems that process audiosignals, such as speech signals, to remove undesired noise componentstherefrom.

BACKGROUND

The term noise suppression generally describes a signal processingtechnique that attempts to attenuate or remove an undesired noisecomponent from an input signal. Noise suppression may be applied toalmost any type of input signal that may include anundesired/interfering component such as a noise component. For example,noise suppression functionality is often implemented intelecommunications devices, such as telephones, Bluetooth® headsets, orthe like, to attenuate or remove an undesired background noise componentfrom an input speech signal. In general, an input speech signal may beviewed as comprising both a desired speech component (sometimes referredto as “clean speech”) and a background noise component. Removing thebackground noise component from the input speech signal ideally leavesonly the desired speech component as output.

In multi-microphone systems, noise suppression is often implementedbased on the Generalized Sidelobe Canceler (GSC). The GSC consists of afixed beamformer, a blocking matrix, and an adaptive noise canceler. Inthe most general case, the fixed beamformer functions to filter M inputspeech signals received from M microphones to create a so-called speechreference signal comprising a desired speech component and a backgroundnoise component. The blocking matrix creates M−1 background noisereferences by spatially suppressing the desired speech component in theM input speech signals. The adaptive noise canceler then estimates thebackground noise component in the speech reference signal, produced bythe fixed beamformer based on the M−1 background noise references andsuppresses the estimated background noise component from the speechreference signal, thereby ideally leaving only the desired speechcomponent as output.

However, in some multi-microphone systems, at least one microphone isdedicated as a noise reference microphone and at least one microphone isdedicated as a primary speech microphone. The noise reference microphoneis positioned to be relatively far from a desired speech source duringregular use of the multi-microphone system. In fact, the noise referencemicrophone can be positioned to be as far from the desired speech sourceas possible during regular use of the multi-microphone system.Therefore, the input speech signal received by the noise referencemicrophone often will have a very poor signal-to-noise ratio (SNR). Theprimary speech microphone, on the other hand, is positioned to berelatively close to the desired speech source during regular use and, asa result, usually receives an input speech signal that has a much betterSNR compared to the input speech signal received by the noise referencemicrophone.

In these multi-microphone systems, with a dedicated noise referencemicrophone and primary speech microphone, the traditional delay-and-sumfixed beamformer structure of the GSC (described above) may not makemuch sense because it can result in a speech reference signal with anSNR that is worse than that of the unprocessed input speech signalreceived by the primary speech microphone. In general, it is possible toget constructive interference between the desired speech components ofinput speech signals received by multiple microphones using thetraditional delay-and-sum fixed beamformer structure. However, in thecase of a multi-microphone system with a noise reference microphone anda primary speech microphone as described above, the traditionaldelay-and-sum fixed beamformer structure is often unable to improve theSNR compared to the primary speech microphone because of the poor SNR ofthe input speech signal received by the noise reference microphone.Thus, using the traditional delay-and-sum fixed beamformer structure insuch a multi-microphone system often will result in a speech referencesignal that has a worse SNR than that of the input speech signalreceived by the primary speech microphone.

Moreover, adaptive algorithms (e.g., a least mean square adaptivealgorithm) conventionally used to derive the filters for the blockingmatrix and the adaptive noise canceler of the GSC are often slow toconverge.

Therefore, what is needed is an approach to multi-channel noisesuppression that does not rely on the traditional delay-and-sum fixedbeamformer structure of the GSC and/or slow to converge adaptivealgorithms for deriving filters used to suppress noise.

BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES

The accompanying drawings, which are incorporated herein and form a partof the specification, illustrate the present invention and, togetherwith the description, further serve to explain the principles of theinvention and to enable a person skilled in the pertinent art to makeand use the invention.

FIG. 1 illustrates a front view of an example wireless communicationdevice in which embodiments of the preset invention can be implemented.

FIG. 2 illustrates a back view of the example wireless communicationdevice shown in FIG. 1.

FIG. 3 illustrates a block diagram of an example system formulti-channel noise suppression in accordance with an embodiment of thepresent invention.

FIG. 4 illustrates an example piecewise linear mapping from differencein energy between a primary input speech signal and a reference inputspeech signal to adaptation factor for the blocking matrix statistics inaccordance with an embodiment of the present invention.

FIG. 5 illustrates a flowchart of a method for estimating time-varyingstatistics for a closed-form solution of a blocking matrix filter inaccordance with an embodiment of the present invention.

FIG. 6 illustrates an example piecewise linear mapping from differencein energy between a primary input speech signal and a reference inputspeech signal to adaptation factor for estimating statistics ofstationary noise in accordance with an embodiment of the presentinvention.

FIG. 7 illustrates a flowchart of a method for estimating time-varyingstationary background noise statistics in accordance with an embodimentof the present invention.

FIG. 8 illustrates example piecewise linear mappings from difference inenergy (or moving average of difference in energy) between a primaryinput speech signal and a “cleaner” background noise component toadaptation factor for estimating time varying statistics for a closedform solution of an ANC section in accordance with an embodiment of thepresent invention

FIG. 9 illustrates a flowchart of a method for estimating thetime-varying statistics of an adaptive noise canceler filter inaccordance with an embodiment of the present invention.

FIG. 10 illustrates an exemplary variation of the multi-channel noisesuppression system of FIG. 3 that further implements an automaticmicrophone calibration scheme in accordance with an embodiment of thepresent invention

FIG. 11 illustrates a flowchart of a method for updating a currentestimated value of a microphone sensitivity mismatch in accordance withan embodiment of the present invention.

FIG. 12 illustrates a block diagram of an example computer system thatcan be used to implement aspects of the present invention.

The present invention will be described with reference to theaccompanying drawings. The drawing in which an element first appears istypically indicated by the leftmost digit(s) in the correspondingreference number.

DETAILED DESCRIPTION 1. Introduction

In the following description, numerous specific details are set forth inorder to provide a thorough understanding of the invention. However, itwill be apparent to those skilled in the art that the invention,including structures, systems, and methods, may be practiced withoutthese specific details. The description and representation herein arethe common means used by those experienced or skilled in the art to mosteffectively convey the substance of their work to others skilled in theart. In other instances, well-known methods, procedures, components, andcircuitry have not been described in detail to avoid unnecessarilyobscuring aspects of the invention.

References in the specification to “one embodiment,” “an embodiment,”“an example embodiment,” etc., indicate that the embodiment describedmay include a particular feature, structure, or characteristic, butevery embodiment may not necessarily include the particular feature,structure, or characteristic. Moreover, such phrases are not necessarilyreferring to the same embodiment. Further, when a particular feature,structure, or characteristic is described in connection with anembodiment, it is submitted that it is within the knowledge of oneskilled in the art to affect such feature, structure, or characteristicin connection with other embodiments whether or not explicitlydescribed.

As noted in the background section above, certain multi-microphonesystems include a primary speech microphone and a noise referencemicrophone. The primary speech microphone is positioned to be close to adesired speech source during regular use of the multi-microphone system,whereas the noise reference microphone is positioned to be farther fromthe desired speech source during regular use of the multi-microphonesystem. Therefore, the input speech signal received by the primaryspeech microphone typically will have a better SNR compared to the inputspeech signal received by the noise reference microphone. In thesemulti-microphone systems, if the SNR on the noise reference phone ismuch worse than the primary speech microphone then the use of atraditional delay-and-sum fixed beamformer structure to suppressbackground noise generally does not make much sense because it canresult in a speech reference signal with an SNR that is worse than thatof the unprocessed input speech signal received by the primary speechmicrophone.

The multi-channel noise suppression systems and methods described hereinomit the traditional delay-and-sum fixed beamformer in devices thatinclude a primary speech microphone and at least one noise referencemicrophone as noted above. The multi-channel noise suppression systemsand methods use a blocking matrix (BM) to remove desired speech in theinput speech signal received by the noise reference microphone to get a“cleaner” background noise component. Then, an adaptive noise canceler(ANC) is used to remove the background noise in the input speech signalreceived by the primary speech microphone based on the “cleaner”background noise component to achieve noise suppression.

In accordance with embodiments described herein, the filters implementedby the BM and ANC are derived using closed-form solutions that requirecalculation of time-varying statistics (for frequency domainimplementations) of complex signals in the noise suppression system.Conventionally, adaptive algorithms that are potentially slow toconverge have been used to derive such filters. Furthermore, inaccordance with embodiments described herein, spatial informationembedded in the input speech signals received by the primary speechmicrophone and the noise reference microphone is exploited to estimatethe necessary time-varying statistics to perform closed-formcalculations of the filters implemented by the BM and ANC.

It should be noted that, wherever a difference in energy between twosignals is used to perform a function or determine a subsequent value asdescribed below (where difference in energy can be calculated, forexample, by subtracting the log-energy of the two signal), a differencein level between the two signals (i.e., difference in signal level) canbe used instead.

2. System for Multi-Channel Noise Suppression

FIGS. 1 and 2 respectively illustrate a front portion 100 and a backportion 200 of an example wireless communication device 102 in whichembodiments of the present invention can be implemented. Wirelesscommunication device 102 can be a personal digital assistant (PDA), acellular telephone, or a tablet computer, for example.

As shown in FIG. 1, front portion 100 of wireless communication device102 includes a primary speech microphone 104 that is positioned to beclose to a user's mouth during regular use of wireless communicationdevice 102. Accordingly, primary speech microphone 104 is positioned tocapture the user's speech (i.e., the desired speech). As shown in FIG.2, a back portion 200 of wireless communication device 102 includes anoise reference microphone 106 that is positioned to be farther from theuser's mouth during regular use than primary speech microphone 104. Forinstance, noise reference microphone 106 can be positioned as far fromthe user's mouth during regular use as possible.

Although the input speech signals received by primary speech microphone104 and noise reference microphone 106 will each contain desired speechand background noise components, by positioning primary speechmicrophone 104 so that it is closer to the user's mouth than noisereference microphone 106 during regular use, the level of the user'sspeech that is captured by primary speech microphone 104 is likely to begreater than the level of the user's speech that is detected by noisereference microphone 106. This, along with the observation that noisesources which are further from the device will produce approximatelysimilar levels on the two microphones, can be exploited to effectivelyestimate the necessary statistics to calculate filter coefficients forsuppressing background noise as will be described further below inregard to FIG. 3.

It should be noted that primary speech microphone 104 and noisereference microphone 106 are shown to be positioned on the respectivefront and back portions of wireless communication device 102 forillustrative purposes only and is not intended to be limiting. Personsskilled in the relevant art(s) will recognize that primary speechmicrophone 104 and noise reference microphone 106 can be positioned inany suitable locations on wireless communication device 102.

It should be further noted that a single noise reference microphone 106is shown in FIG. 2 for illustrative purposes only and is not intended tobe limiting. Persons skilled in the relevant art(s) will recognize thatwireless communication device 102 can include any reasonable number ofreference microphones.

Moreover, primary speech microphone 104 and noise reference microphone106 are respectively shown in FIGS. 1 and 2 to be included in wirelesscommunication device 102 for illustrative purposes only. It will berecognized by persons skilled in the relevant art(s) that primary speechmicrophone 104 and noise reference microphone 106 can be implemented inany suitable multi-microphone system or device that operates to processaudio signals for transmission, storage and/or playback to a user. Forexample, primary speech microphone 104 and noise reference microphone106 can be implemented in a Bluetooth® headset, a hearing aid, apersonal recorder, a video recorder, or a sound pick-up system forpublic speech.

Referring now to FIG. 3, a block-diagram of a multi-channel noisesuppression system 300 that can be implemented in wireless communicationdevice 102 is illustrated in accordance with an embodiment of thepresent invention. System 300 is configured to process a primary inputspeech signal P(m, f) received by primary speech microphone 104 and areference input speech signal R(m, f) received by noise referencemicrophone 106 to attenuate or remove background noise from P(m, f). Asnoted above, both input speech signals P(m, f) and R(m, f), received bythe two microphones, contain components of the user's speech (i.e., thedesired speech) and background noise. More specifically, P(m, f)contains a desired speech component S₁(m, f) and a background noisecomponent N₁(m, f), and R(m, f) contains a desired speech componentS₂(m, f) and a background noise component N₂(m, f). However, because ofthe position of primary speech microphone 104 and noise referencemicrophone 106 on wireless communication device 102 relative to theexpected position of the desired speech source, the level of the desiredspeech component S₁(m, f) in P(m, f) is likely to be greater than thelevel of the desired speech component S₂(m, f) in R(m, f). In addition,there will typically be little difference in level between thebackground noise components N₁(m, f) and N₂(m, f) of the two inputspeech signals because the relative distance between each microphone anda background noise source is expected to be about the same in mostinstances, or at the least far more similar than the than the relativedistance between the desired speech source and the two microphones,respectively. Hence, the level difference for a desired speech sourcewill be greater than the level difference for noise sources. This can beused to discriminate between desired and interfering (noise) sources.System 300 is configured to exploit this information to filter P(m, f)using R(m, f) to provide, as output, a noise suppressed primary inputspeech signal Ŝ₁(m, f).

As shown in FIG. 3, system 300 includes a blocking matrix (BM) 305 andan adaptive noise canceler (ANC) 310. BM 305 is configured to estimateand remove the desired speech component S₂(m, f) in R(m, f) to produce a“cleaner” background noise component N₂(m, f). More specifically, BM 305includes a blocking matrix filter 315 configured to filter P(m, f) toprovide an estimate of the desired speech component S₂(m, f) in R(m, f).BM 305 then subtracts the estimated desired speech component Ŝ₂(m, f)from R(m, f) using subtractor 320 to provide, as output, the “cleaner”background noise component {circumflex over (N)}₂(m, f).

After {circumflex over (N)}₂(m, f) has been obtained, ANC 310 isconfigured to estimate and remove the undesirable background noisecomponent N₁(m, f) in P(m, f) to provide, as output, the noisesuppressed primary input speech signal Ŝ₁(m, f). More specifically, ANC310 includes an adaptive noise canceler filter 325 configured to filterthe “cleaner” background noise component N₂(m, f) to provide an estimateof the background noise component N₁(m, f) in P(m, f). ANC 310 thensubtracts the estimated background noise component {circumflex over(N)}₁(m, f) from P(m, f) using subtractor 330 to provide, as output, thenoise suppressed primary input speech signal Ŝ₁(in, f).

In an embodiment, and as illustrated in FIG. 3, the primary input speechsignal P(m, f) and the reference input speech signal R(m, f) arerepresented and processed in the frequency domain, on a frame-by-framebasis, by BM 305 and ANC 310, where m indexes the time or a particularframe made up of consecutive time domain samples of the input speechsignal and f indexes a particular frequency component or sub-band of theinput speech signal. Thus, for example, P(1,10) denotes the complexvalue of the 10^(th) frequency component or sub-band for the 1^(st) timeindex or frame of the primary input speech signal P(m, f). The samerepresentation is true, in at least one embodiment, for other signalsand signal components illustrated in FIG. 3. It should be noted that inother embodiments the primary input speech signal P(m, f) and thereference input speech signal R(m, f) can be represented and processedin the time domain on a frame-by-frame basis.

Although system 300 is described above as being implemented in wirelesscommunication device 102 illustrated in FIG. 1, system 300 can beimplemented in any suitable multi-microphone system or device thatoperates to process audio signals for transmission, storage and/orplayback to a user. For example, system 300 can be implemented in aBluetooth® headset, a hearing aid, a personal recorder, a videorecorder, or a sound pick-up system for public speech. System 300 can beimplemented in hardware using analog and/or digital circuits, insoftware, through the execution of instructions by one or more generalpurpose or special-purpose processors, or as a combination of hardwareand software.

In the sub-sections that follow, exemplary derivations of closed formsolutions for a frequency domain blocking matrix filter 315 and a hybridapproach blocking matrix filter 315 are described. In addition, in thefollowing sub-sections that follow, exemplary derivations of closed formsolutions for a frequency domain adaptive noise canceler filter 325 anda hybrid approach adaptive noise canceler filter 325 are described.

2.1 The Blocking Matrix

As noted above, BM 305 includes a blocking matrix filter 315 configuredto filter the primary input speech signal P(m, f) to provide an estimateof the desired speech component S₂(m, f) in the reference input speechsignal R(m, f). BM 305 then subtracts the estimated desired speechcomponent Ŝ₂(m, f) from R(m, f) using subtractor 320 to provide the“cleaner” background noise component {circumflex over (N)}₂(m, f).

Ideally, no residual amount of the desired speech component S₂(m, f) isleft in the “cleaner” background noise component {circumflex over(N)}₂(m, f). However, because of the time-varying nature of the signalsprocessed by BM 305 and the inability of the blocking matrix filter toperfectly model the acoustic channel for the desired speech between thetwo microphones, often some residual amount of the desired speechcomponent S₂(m, f) will be left in the “cleaner” background noisecomponent {circumflex over (N)}₂(m, f). This residual amount of thedesired speech component S₂(m, f) can be observed at the output of BM305 (i.e., based on {circumflex over (N)}₂(m, f)) during periods of time(or frames) when mostly desired speech, and little or no backgroundnoise, makes up the primary input speech signal P(m, f). If BM 305 isfunctioning well, the output of BM 305, {circumflex over (N)}₂(m, f),should be nearly zero during these periods of time (or frames). Theresidual amount of desired speech component S₂(m, f) can be simplyexpressed as:

$\begin{matrix}\begin{matrix}{{{\hat{N}}_{2}\left( {m,f} \right)} = {{R\left( {m,f} \right)} - {{\hat{S}}_{2}\left( {m,f} \right)}}} \\{= {{R\left( {m,f} \right)} - {{H(f)}{P\left( {m,f} \right)}}}}\end{matrix} & (1)\end{matrix}$

where H(f) is the transfer function of blocking matrix filter 315, mindexes the time or frame, and f indexes a particular frequencycomponent or sub-band.

To achieve the objective of removing the desired speech component S₂ (m,f) in the reference input speech signal R(m, f), the transfer functionH(f) of blocking matrix filter 315 can be derived (or updated) tosubstantially minimize the power of the residual signal expressed in Eq.(1) during periods of time (or frames) when the primary input speechsignal P(m, f) is predominantly equal to the desired speech signal Ŝ₁(m,f). The power of the residual signal, also referred to as a costfunction, can be expressed as:

$\begin{matrix}{E_{{\hat{N}}_{2}} = {\sum\limits_{m}{\sum\limits_{f}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}}} & (2)\end{matrix}$

where ( )* indicates complex conjugate.

In the following sub-sections, a frequency domain blocking matrix filter315 and a hybrid approach blocking matrix filter 315 are derived (orupdated) based on this cost function.

2.1.1 Example Derivation of Frequency Domain Blocking Matrix Filter

The frequency domain blocking matrix filter 315 is derived (or updated)based on a closed form solution below assuming a single complex tap perfrequency bin. However, persons skilled in the relevant art(s) willrecognize based on the teachings herein that the proposed solution canbe generalized to multiple taps per bin.

The cost function expressed in Eq. (2) is expanded as:

$\begin{matrix}\begin{matrix}{E_{{\hat{N}}_{2}} = {\sum\limits_{m}{\sum\limits_{f}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}}} \\{= {\sum\limits_{f}{\sum\limits_{m}{\left( {{R\left( {m,f} \right)} - {{H(f)}{P\left( {m,f} \right)}}} \right)\left( {{R\left( {m,f} \right)} - {{H(f)}{P\left( {m,f} \right)}}} \right)^{*}}}}} \\{= {{\sum\limits_{f}{\sum\limits_{m}{{R\left( {m,f} \right)}{R^{*}\left( {m,f} \right)}}}} - {{H(f)}{\sum\limits_{m}{{P\left( {m,f} \right)}{R^{*}\left( {m,f} \right)}}}} -}} \\{{{H^{*}(f)}{\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}}} \\{= {{\sum\limits_{f}{C_{R,R^{*}}(f)}} - {{H(f)}{C_{P,R^{*}}(f)}} - {{H^{*}(f)}{C_{R,P^{*}}(f)}} +}} \\{{{H(f)}{H^{*}(f)}{C_{P,P^{*}}(f)}}}\end{matrix} & (3)\end{matrix}$

The gradient of E_({circumflex over (N)}) ₂ with respect to H(f) iscalculated from:

$\begin{matrix}{{\nabla_{H}\left( E_{{\hat{N}}_{2}} \right)} = {\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Re}}\left\{ {H(f)} \right\}} + {j\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Im}}\left\{ {H(f)} \right\}}}}} & (4)\end{matrix}$

by inserting:

$\begin{matrix}{\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Re}}\left\{ {H(f)} \right\}} = {{- {C_{P,R^{*}}(f)}} - {C_{R,P^{*}}(f)} + {{H(f)}{C_{P,P^{*}}(f)}} + {{H^{*}(f)}{C_{P,P^{*}}(f)}}}} & (5) \\{\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Im}}\left\{ {H(f)} \right\}} = {{{- j}\; {C_{P,R^{*}}(f)}} + {j\; {C_{R,P^{*}}(f)}} + {j\; {H^{*}(f)}{C_{P,P^{*}}(f)}} - {j\; {H(f)}{C_{P,P^{*}}(f)}}}} & (6)\end{matrix}$

resulting in:

$\begin{matrix}\begin{matrix}{{\nabla_{H}\left( E_{{\hat{N}}_{2}} \right)} = {{{- 2}{C_{R,P^{*}}(f)}} + {2{H(f)}{C_{P,P^{*}}(f)}}}} \\{= 0} \\\left. \Downarrow \right. \\{{H(f)} = \frac{C_{R,P^{*}}(f)}{C_{P,P^{*}}(f)}}\end{matrix} & (7)\end{matrix}$

where C_(R,P*)(f) and C_(P,P*)(f) represent time-varying statisticsderived (or updated) during periods of time (or frames) when the inputspeech signal P(m, f) is predominantly equal to the desired speechsignal S₁(m, f). This can be quantified by the energy of the desiredspeech signal being greater than the energy of the background by asignificant degree. The statistics can be expressed as:

$\begin{matrix}{\; {{C_{R,P^{*\;}}(f)} = {\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}}} & (8) \\{{C_{P,P^{*}}(f)} = {\sum\limits_{m}{{P\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}} & (9)\end{matrix}$

The condition that these statistics be derived (or updated) when theenergy of the desired speech is greater than the energy of thebackground noise in primary input speech signal P(m, f) by a largedegree means that reference input speech signal R(m, f) and primaryinput speech signal P(m, f) generally are dominated by desired speech,ideally only include desired speech. Thus, the calculation ofC_(R,P*)(f) as the sum of products of the reference input speech signalR(m, f) and the complex conjugate primary input speech signal P(m, f) ata given frequency bin f for some number of frames can be seen as a wayof estimating the cross-spectrum at that frequency bin between thedesired speech component in the reference input speech signal R(m, f)and the desired speech component in the primary input speech signal P(m,f). Consequently, C_(R,P*)(f) can be referred to as the cross-channelstatistics of the desired speech, or just desired speech cross-channelstatistics.

Similarly, the calculation of C_(P,P*)(f) as the sum of products of theprimary input speech signal P(m, f) and its own complex conjugate at agiven frequency bin f for some number of frames can be seen as a way ofestimating the power spectrum at that frequency bin of the desiredspeech component in the primary input speech signal P(m, f).Consequently, C_(P,P*)(f) can be referred to as the desired speechstatistics of the primary input speech signal.

Collectively, the cross-channel statistics of the desired speech and thedesired speech statistics of the primary input speech signal can bereferred to as simply the desired speech statistics. Further details andvariants on the method of calculating the desired speech statistics areprovided below in section 3.

In the embodiment where blocking matrix filter 315 is implemented in thefrequency domain by multiplication, statistics estimator 335,illustrated in FIG. 3, is configured to derive (or update) estimates ofthe statistics C_(R,P*)(f) and C_(P,P*)(f) and provide the estimates tocontroller 340, also illustrated in FIG. 3. Controller 340 is thenconfigured to use the estimates of the statistics C_(R,P*)(f) andC_(P,P*)(f) to configure blocking matrix filter 315. For example,controller 340 can use these values to configure blocking matrix filter315 in accordance with the transfer function H(f) expressed in Eq. (7),although this is only one example.

2.1.2 Example Derivation of Hybrid Approach Blocking Matrix Filter

A hybrid variation of blocking matrix filter 315 in accordance with anembodiment of the present invention will now be described. The hybridvariation combines the frequency domain approach described above with atime domain approach. This can be a practical solution to performingnoise suppression within a sub-band based audio system where anincreased frequency resolution is desirable for the noise suppressor.The limited frequency resolution is expanded by applying a low-ordertime domain solution to individual frequency bins or sub-bands. Thisalso offers the possibility of expanding the frequency resolution basedon a psycho-acoustically motivated frequency resolution, e.g., expandlow frequency regions more than high frequency regions. As a practicalexample, one may have a sub-band decomposition with 32 complex sub-bandsin 0 to 4 kHz. This provides a spectral resolution of 125 Hz which maybe inadequate. Instead of expanding the spectral resolution of allsub-bands to 32 Hz by a 4^(th) order noise suppression filter, it may bedesirable to expand the low sub-bands by 4, the middle sub-bands by 2,and leave the upper sub-bands at the native resolution.

The hybrid approach changes the “filtering” with the transfer functionH(f) from:

Ŝ ₂(m,f)=H(f)P(m,f)  (10)

to:

$\begin{matrix}{{{\hat{S}}_{2}\left( {m,f} \right)} = {\sum\limits_{k = 0}^{K}{{H\left( {k,f} \right)}{P\left( {{m - k},f} \right)}}}} & (11)\end{matrix}$

where m indexes the time or frame, f indexes a particular sub-band, andk=0, 1 . . . K indexes the individual filter coefficients for aparticular frequency index f, making up the noise suppression timedirection filter in that particular frequency bin. Hence, the term timedirection filter can be used to refer to the individual noisesuppression filters that filter the frequency bins, or sub-band signals,of the primary input speech signal P(m, f) in the time direction.

The residual signal in Eq. (1) can be rewritten based on Eq. (11) asfollows:

$\begin{matrix}{{{\hat{N}}_{2}\left( {m,f} \right)} = {{R\left( {m,f} \right)} - {\sum\limits_{k = 0}^{K}{{H\left( {k,f} \right)}{P\left( {{m - k},f} \right)}}}}} & (12)\end{matrix}$

Substituting Eq. (12) into Eq. (2), the gradient ofE_({circumflex over (N)}) ₂ with respect to H(k, f) is calculated as:

$\begin{matrix}\begin{matrix}{{\nabla_{H{({k,f})}}\left( E_{{\hat{N}}_{2}} \right)} = {\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}} + {j\; \frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}}}}} \\{= {{\sum\limits_{m}{{{\hat{N}}_{2}^{*}\left( {m,f} \right)}\frac{\partial{{\hat{N}}_{2}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}}}} +}} \\{{{{{\hat{N}}_{2}\left( {m,f} \right)}\frac{\partial{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}}} + {j\; {\sum\limits_{m}{N_{2}^{*}\left( {m,f} \right)}}}}} \\{{\frac{\partial{{\hat{N}}_{2}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}} + {{{\hat{N}}_{2}\left( {m,f} \right)}\frac{\partial{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}}}}} \\{= {{\sum\limits_{m}{{- {{\hat{N}}_{2}^{*}\left( {m,f} \right)}}{P\left( {{m - k},f} \right)}}} - {{\hat{N}}_{2}\left( {m,f} \right)}}} \\{{{P^{*}\left( {{m - k},f} \right)} + {j{\sum\limits_{m}{{- {{\hat{N}}_{2}^{*}\left( {m,f} \right)}}j\; {P\left( {{m - k},f} \right)}}}} +}} \\{{{{\hat{N}}_{2}\left( {m,f} \right)}j\; {P^{*}\left( {{m - k},f} \right)}}} \\{= {{- 2}{\sum\limits_{m}{{N_{2}\left( {m,f} \right)}{P^{*}\left( {{m - k},f} \right)}}}}} \\{= {{- 2}{\sum\limits_{m}\left( {{R\left( {m,f} \right)} - {\sum\limits_{l = 0}^{K}{{H\left( {l,f} \right)}{P\left( {{m - l},f} \right)}}}} \right)}}} \\{{P^{*}\left( {{m - k},f} \right)}} \\{= {2{\overset{K}{\sum\limits_{l = 0}}{{H\left( {l,f} \right)}\left( {\sum\limits_{m}{{P\left( {{m - l},f} \right)}{P^{*}\left( {{m - k},f} \right)}}} \right)}}}} \\{{2\left( {\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {{m - k},f} \right)}}} \right)}} \\{= 0}\end{matrix} & (13)\end{matrix}$

The set of K+1 equations (for k=0, 1, . . . K) of Eq. (13) provides amatrix equation for every frequency bin f to solve for H(k, f), wherek=0, 1, . . . K:

$\begin{matrix}{{\begin{bmatrix}{\sum\limits_{m}{{P\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}} & {\sum\limits_{m}{{P\left( {m - 1} \right)}{P^{*}\left( {m,f} \right)}}} & \ldots & {\sum\limits_{m}{{P\left( {{m - K},f} \right)}{P^{*}\left( {m,f} \right)}}} \\{\sum\limits_{m}{{P\left( {m,f} \right)}{P^{*}\left( {{m - 1},f} \right)}}} & {\sum\limits_{m}{{P\left( {{m - 1},f} \right)}{P^{*}\left( {{m - 1},f} \right)}}} & \ldots & {\sum\limits_{m}{{P\left( {{m - K},f} \right)}{P^{*}\left( {{m - 1},f} \right)}}} \\\vdots & \vdots & \ddots & \vdots \\{\sum\limits_{m}{{P\left( {m,f} \right)}{P^{*}\left( {{m - K},f} \right)}}} & {\sum\limits_{m}{{P\left( {{m - 1},f} \right)}{P^{*}\left( {{m - K},f} \right)}}} & \ldots & {\sum\limits_{m}{{P\left( {{m - K},f} \right)}{P^{*}\left( {{m - K},f} \right)}}}\end{bmatrix}\begin{bmatrix}{H\left( {0,f} \right)} \\{H\left( {1,f} \right)} \\\vdots \\{H\left( {K,f} \right)}\end{bmatrix}} = \begin{bmatrix}{\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}} \\{\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {{m - 1},f} \right)}}} \\\vdots \\{\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {{m - K},f} \right)}}}\end{bmatrix}} & (14)\end{matrix}$

This solution can be written as:

R _(P)(f)· H ₂(f)= r _(R,P*)(f)  (15)

where:

$\begin{matrix}{{{\underset{\underset{\_}{\_}}{R}}_{P}(f)} = {\sum\limits_{m}{{{\underset{\_}{P}}^{*}\left( {m,f} \right)} \cdot {\underset{\_}{P}\left( {m,f} \right)}^{T}}}} & (16) \\{{{\underset{\_}{r}}_{R,P^{*}}(f)} = {\sum\limits_{m}{{R\left( {m,f} \right)} \cdot {{\underset{\_}{P}}^{*}\left( {m,f} \right)}}}} & (17) \\{{{\underset{\_}{P}\left( {m,f} \right)} = \begin{bmatrix}{P\left( {m,f} \right)} \\{P\left( {{m - 1},f} \right)} \\\vdots \\{P\left( {{m - K},f} \right)}\end{bmatrix}},{{\underset{\_}{H}(f)} = \begin{bmatrix}{H\left( {0,f} \right)} \\{H\left( {1,f} \right)} \\\vdots \\{H\left( {K,f} \right)}\end{bmatrix}}} & (18)\end{matrix}$

and the superscript T denotes non-conjugate transpose. The solution perfrequency bin to the time direction filter is thus given by:

H (f)=( R _(P)(f))⁻ ·r _(R,P*)(f)  (19)

This solution appears to require a matrix inversion, but in mostpractical applications a matrix inversion is not needed.

In the embodiment where blocking matrix filter 315 is implemented basedon the hybrid approach, statistics estimator 335 is configured to derive(or update) estimates of the statistics expressed in Eq. (16) and Eq.(17) and provide the estimates to controller 340. Controller 340 is thenconfigured to use the estimates of the statistics to configure blockingmatrix filter 315. For example, controller 340 can use these values toconfigure blocking matrix filter 315 in accordance with the transferfunction H(f) expressed in Eq. (19), although this is only one example.

Comparing Eq. (16) and Eq. (17) to Eq. (9) and Eq. (8), respectively, itcan be seen that the similar statistics are calculated by each set ofequations, except that instead of calculating statistics only betweencurrent frequency bin components of signals, the hybrid solutionrequires calculation of statistics between vectors of current and pastfrequency bin components of signals, i.e. a time dimension is now partof the statistics. At the extreme, with no Discrete Fourier Transform(DFT), i.e. a single full band signal (the time domain signal), thehybrid method becomes a pure time domain method, and hence, the solutionabove provides the solution also for a pure time domain approach. Thefrequency index would become obsolete (as there is only one frequencyband), and the signal vectors in the time direction would contain thesignal time domain samples. A farther simplification in that case isthat the time domain signal without DFT is real and not complex as inthe case of the DFT bins or if a complex sub-band analysis has beenapplied.

2.1.3 Alternative Approach to Blocking Matrix

As discussed above, to achieve the objective of removing the desiredspeech component S₂(m, f) in the reference input speech signal R(m, f),the transfer function H(f) of blocking matrix filter 315 can be derived(or updated) to substantially minimize the power of the residual signal,also referred to as a cost function, expressed in Eq. (2) during periodsof time (or frames) when the primary input speech signal P(m, f) ispredominantly desired speech.

As an alternative method to achieve the objective of removing thedesired speech component S₂(m, f) in the reference input speech signalR(m, f), the transfer function H(f) of blocking matrix filter 315 can bederived (or updated) to substantially minimize the power of thedifference between the background noise component N₂(m, f) in thereference input speech signal R(m, f) and the output of BM 305,{circumflex over (N)}₂(m, f). The power of the difference between thebackground noise component N₂(m,f) and the output of BM 305, {circumflexover (N)}₂(m, f), can be expressed as:

$\begin{matrix}{E_{{\hat{N}}_{2}} = {\sum\limits_{m}{\sum\limits_{f}{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}}}}} & (20)\end{matrix}$

where ( )* indicates complex conjugate.

Accommodating the hybrid approach, from Eq. (20) the gradient ofE_({circumflex over (N)}) ₂ with respect to H(k, f) is calculated as:

$\begin{matrix}\begin{matrix}{{\nabla_{H{({k,f})}}\left( E_{{\hat{N}}_{2}} \right)} = {\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}} + {j\frac{\partial E_{{\hat{N}}_{2}}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}}}}} \\{= {{\sum\limits_{m}\begin{pmatrix}\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*} \\{\frac{\partial\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}} +} \\\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right) \\\frac{\partial\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}}\end{pmatrix}} +}} \\{{\underset{m}{j\;\sum}\begin{pmatrix}\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*} \\{\frac{\partial\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}} +} \\\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right) \\\frac{\partial\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}}\end{pmatrix}}} \\{= {{- {\sum\limits_{m}{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}\frac{\partial{{\hat{N}}_{2}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}}}}} +}} \\{{{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)\frac{\partial{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {H\left( {k,f} \right)} \right\}}} -}} \\{{{j\; {\sum\limits_{m}{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}\frac{\partial{{\hat{N}}_{2}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}}}}} +}} \\{{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)\frac{\partial{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {H\left( {k,f} \right)} \right\}}}} \\{= {{\sum\limits_{m}{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}{P\left( {{m - k},f} \right)}}} +}} \\{{{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right){P^{*}\left( {{m - k},f} \right)}} +}} \\{{{j{\sum\limits_{m}{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)^{*}j\; {P\left( {{m - k},f} \right)}}}} -}} \\{{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right)j\; {P^{*}\left( {{m - k},f} \right)}}} \\{= {2{\sum\limits_{m}{\left( {{N_{2}\left( {m,f} \right)} - {{\hat{N}}_{2}\left( {m,f} \right)}} \right){P^{*}\left( {{m - k},f} \right)}}}}} \\{= {2{\sum\limits_{m}{\begin{pmatrix}{{N_{2}\left( {m,f} \right)} - {R\left( {m,f} \right)} +} \\{\sum\limits_{l = 0}^{K}{{H\left( {l,f} \right)}{P\left( {{m - l},f} \right)}}}\end{pmatrix}{P^{*}\left( {{m - k},f} \right)}}}}} \\{= {{2{\sum\limits_{l = 0}^{K}{{H\left( {l,f} \right)}\left( {\sum\limits_{m}{{P\left( {{m - l},f} \right)}{P^{*}\left( {{m - k},f} \right)}}} \right)}}} -}} \\{{{2\left( {\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {{m - k},f} \right)}}} \right)} +}} \\{{2\left( {\sum\limits_{m}{{N_{2}\left( {m,f} \right)}{P^{*}\left( {{m - k},f} \right)}}} \right)}} \\{= 0}\end{matrix} & (21)\end{matrix}$

Using the definitions of sub-section 2.1.2, the solution is given by thefollowing matrix equation:

$\begin{matrix}{{{{\underset{\_}{\underset{\_}{R}}}_{P}(f)} \cdot {\underset{\_}{H}(f)}} = {{{{\underset{\_}{r}}_{R,P^{*}}(f)} - \left. {{\underset{\_}{r}}_{N_{2},P^{*}}(f)}\Updownarrow {\underset{\_}{H}(f)} \right.} = {\left( {{\underset{\_}{\underset{\_}{R}}}_{P}(f)} \right)^{- 1} \cdot \left( {{{\underset{\_}{r}}_{R,P^{*}}(f)} - {{\underset{\_}{r}}_{N_{2},P^{*}}(f)}} \right)}}} & (22)\end{matrix}$

In practice, the estimation of r _(N) ₂ _(,P*)(f) can be carried outbased on a (reasonable) assumption of desired speech and backgroundnoise being independent:

$\begin{matrix}\begin{matrix}{{r_{N_{2},P^{*}}\left( {k,f} \right)} = {\sum\limits_{m}{{N_{2}\left( {m,f} \right)}{P^{*}\left( {{m - k},f} \right)}}}} \\{= {\sum\limits_{m}{{N_{2}\left( {m,f} \right)}\left( {{S_{1}^{*}\left( {{m - k},f} \right)} + {N_{1}^{*}\left( {{m - k},f} \right)}} \right)}}} \\{\approx {\sum\limits_{m}{{N_{2}\left( {m,f} \right)}{N_{1}^{*}\left( {{m - k},f} \right)}}}} \\{= {r_{N_{2},N_{1}^{*}}\left( {k,f} \right)}}\end{matrix} & (23)\end{matrix}$

Hence, Eq. (22) can be simplified to:

H (f)=( R _(P)(f))⁻¹·( r _(R,P*)(f)− r _(N) ₂ _(,N) ₁ _(*)(f))  (24)

Eq. (24) facilitates updating blocking matrix 315 when background noiseis present in the environment of primary speech microphone 104 and noisereference microphone 106. This can be beneficial because mostenvironmental background noise is not intermittent like speech, andhence it can be impractical to locate segments of primarily desiredspeech in primary input speech signal P(m, f) and reference input speechsignal R(m, f) for updating the statistics required by the closed-formsolution for the blocking matrix 315. The statistics r _(N) ₂ _(,N) ₁(f) can be estimated during desired speech absence. From examination ofEq. (24), it is immediately evident that H(f) of Eq. (24) converges to 0during desired speech absence and clean-speech-H(f) (Eq. (19)) duringbackground noise absence.

From Eq. (24), the solution according to the alternative approach for asingle complex tap, K=0, is easily written as:

$\begin{matrix}{{H(f)} = \frac{{r_{R,P^{*}}(f)} - {r_{N_{2},N_{1}^{*}}(f)}}{R_{P}(f)}} & (25)\end{matrix}$

or, according to the notation of sub-section 2.1.1, as:

$\begin{matrix}{{H(f)} = \frac{{C_{R,P^{*}}(f)} - {C_{N_{2},N_{1}^{*}}(f)}}{C_{P,P^{*}}(f)}} & (26)\end{matrix}$

In this alternative embodiment, statistics estimator 335 is configuredto obtain (or update) estimates of the statistics used in thecalculations of Eq. (25) and/or Eq. (26) and provide the estimates tocontroller 340. Controller 340 is then configured to use the estimatesto configure blocking matrix filter 315. For example, controller 340 canuse these values to configure blocking matrix filter 315 in accordancewith the transfer function H(f) expressed in Eq. (25) or (26).

2.2 The Adaptive Noise Canceler

As noted above, ANC 310 includes an adaptive noise canceler filter 325configured to filter the “cleaner” background noise component{circumflex over (N)}₂(m, f) to provide an estimate of the backgroundnoise component N₁(m, f) in P(m, f). ANC 310 then subtracts theestimated background noise component {circumflex over (N)}₁(m, f) fromP(m, f) using subtractor 330 to provide, as output, the noise suppressedprimary input speech signal Ŝ₁(m, f).

Ideally, no residual amount of the background noise component N₁(m, f)is left in the noise suppressed primary input speech signal Ŝ₁(m, f).However, because of the time-varying nature of the signals processed byANC 310 and the inability of the ANC filter to perfectly model the realunknown channel, often some residual amount of the background noisecomponent N₁(m, f) will be left in the noise suppressed primary inputspeech signal Ŝ₁(m, f).

To achieve the objective of removing the background noise componentN₁(m, f) in the primary input speech signal P(m, f), the transferfunction W(f) of adaptive noise canceler filter 325 can be derived (orupdated) to substantially minimize the power of the noise suppressedprimary input speech signal Ŝ₁(m, f). In practice the BM is not perfectin removing all desired speech from {circumflex over (N)}₂(m, f), andhence it is wise to bias the minimization of the power of the noisesuppressed primary input speech signal Ŝ₁(m, f) to segments of desiredspeech absence, i.e. noise presence only. The power of the noisesuppressed primary input speech signal Ŝ₁(m, f), also referred to as acost function, can be expressed as:

$\begin{matrix}{E_{{\hat{S}}_{1}} = {\sum\limits_{m}{\sum\limits_{f}{{{\hat{S}}_{1}\left( {m,f} \right)}{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}}}} & (27)\end{matrix}$

where ( )* indicates complex conjugate, m indexes the time or frame, andf indexes a particular frequency component or sub-band.

In the following sub-sections, a frequency domain adaptive noisecanceler filter 325 and a hybrid approach adaptive noise canceler filter325 are derived (or updated) based on the cost function expressed in Eq.(27).

2.2.1 Example Derivation of Frequency Domain Adaptive Noise Canceler

The frequency domain adaptive noise canceler filter 325 is derived (orupdated) based on a closed form solution below assuming a single complextap per frequency bin. However, persons skilled in the relevant art(s)will recognize based on the teachings herein that the proposed solutioncan be generalized to multiple taps per bin.

From FIG. 3:

Ŝ ₁(m,f)=P(m,f)−W(f){circumflex over (N)} ₂(m,f)  (28)

where, again, W(f) represents the transfer function of adaptive noisecanceler filter 325. The gradient of the cost function E_(Ŝ) ₁ expressedin Eq. (27) with respect to the transfer function W(f) of adaptive noisecanceler filter 325 is:

$\begin{matrix}\begin{matrix}{{\nabla_{W{(f)}}\left( E_{{\hat{S}}_{1}} \right)} = {\frac{\partial E_{{\hat{S}}_{1}}}{{\partial{Re}}\left\{ {W(f)} \right\}} + {j\frac{\partial E_{{\hat{S}}_{1}}}{{\partial{Im}}\left\{ {W(f)} \right\}}}}} \\{= {{\sum\limits_{m}{{{\hat{S}}_{1}^{*}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {W(f)} \right\}}}} + {{{\hat{N}}_{2}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {W(f)} \right\}}} +}} \\{{{j{\sum\limits_{m}{{{\hat{S}}_{1}^{*}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {W(f)} \right\}}}}} + {{{\hat{X}}_{1}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {W(f)} \right\}}}}} \\{= {{\sum\limits_{m}{{- {{\hat{S}}_{1}^{*}\left( {m,f} \right)}}{{\hat{N}}_{2}\left( {m,f} \right)}}} - {{{\hat{S}}_{1}\left( {m,f} \right)}\; {{\hat{N}}_{2}^{*}\left( {m,f} \right)}} +}} \\{{{j{\sum\limits_{m}^{\;}{{- {{\hat{S}}_{1}^{*}\left( {m,f} \right)}}j{{\hat{N}}_{2}\left( {m,f} \right)}}}} + {{{\hat{S}}_{1}\left( {m,f} \right)}j{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} \\{= {{- 2}{\sum\limits_{m}{{{\hat{S}}_{1}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}}} \\{= {{- 2}{\sum\limits_{m}{\left( {{P\left( {m,f} \right)} - {{W(f)}{{\hat{N}}_{2}\left( {m,f} \right)}}} \right){{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}}} \\{= {{2\; {W(f)}\left( {\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}} \right)} -}} \\{{2\left( {\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}} \right)}} \\{= 0}\end{matrix} & (29) \\ \Downarrow & \; \\\begin{matrix}{\mspace{79mu} {{W(f)} = \frac{\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}{\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}}} \\{= \frac{C_{P,{\hat{N}}_{2}^{*}}(f)}{C_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}(f)}}\end{matrix} & (30)\end{matrix}$

where C_({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)(f)and C_(P,{circumflex over (N)}) ₂ * (f) represent time-varyingstatistics that are given by:

$\begin{matrix}{{C_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} & (31) \\{{C_{P,{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} & (32)\end{matrix}$

C_({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)(f),expressed in Eq. (31), is given by the sum of products of the “cleaner”background noise component {circumflex over (N)}₂(m, f) with its owncomplex conjugate for some number of frames and is essentially the powerspectrum of the “cleaner” background noise at frequency f.C_({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)(f) can bereferred to as the background noise statistics of the blocking matrixoutput. C_(P,{circumflex over (N)}) ₂ _(*)(f), expressed in Eq. (32), isgiven by the sum of products of the primary input speech, signal P(m, f)and the complex conjugate of the “cleaner” background noise component{circumflex over (N)}₂ (m, f) for some number of frames and isessentially the cross-spectrum at frequency between the two signals.C_(P,{circumflex over (N)}) ₂ _(*)(f) can be referred to as thecross-channel background noise statistics.

Collectively, the background noise statistics of the blocking matrixoutput and the cross-channel background noise statistics can be referredto as the background noise statistics. Further details and variants onthe method of calculating the background noise statistics are providedbelow in section 3.

If BM 305 is effective (in suppressing the desired speech componentS₂(m, f) in the “cleaner” background noise component {circumflex over(N)}₂(m, f)), then the statistics expressed in Eq. (31) and Eq. (32) canbe updated each time (or nearly each time) a new frame of primary inputspeech signal P(m, f) and reference input speech signal R(m, f) isreceived and processed, regardless of the content on the primary inputspeech signal P(m, f) and the reference input speech signal R(m, f).However, in an alternative embodiment (and in a potentially saferapproach), as mentioned above, the statistics of adaptive noise cancelerfilter 325 can be updated primarily during periods of time or frameswhen desired speech is absent.

In the embodiment where adaptive noise canceler filter 325 isimplemented in the frequency domain as a multiplication, statisticsestimator 345, illustrated in FIG. 3, is configured to derive (orupdate) estimates of the statistics expressed in Eq. (31) and Eq. (32)and provide the estimates to controller 350, also illustrated in FIG. 3.Controller 350 is then configured to use the estimates of the statisticsto configure adaptive noise canceler filter 325. For example, controller350 can use these values to configure adaptive noise canceler filter 325in accordance with the transfer function W(f) expressed in Eq. (30),although this is only one example.

2.2.2 Example Derivation of Hybrid Approach Adaptive Noise CancelerFilter

A hybrid variation of adaptive noise canceler filter 325 in accordancewith an embodiment of the present invention will now be described. Thederivation of the hybrid approach follows that of sub-section 2.1.2 forblocking matrix filter 315.

The hybrid approach changes the “filtering” with the transfer functionW(f) from:

{circumflex over (N)} ₁(m,f)=W(f){circumflex over (N)} ₂(m,f)  (33)

to:

$\begin{matrix}{{{\hat{N}}_{1}\left( {m,f} \right)} = {\sum\limits_{k = 0}^{K}{{W\left( {k,f} \right)}{{\hat{N}}_{2}\left( {{m - k},f} \right)}}}} & (34)\end{matrix}$

where m indexes the time or frame, f indexes a particular sub-band, andk=0, 1 . . . K indexes the individual filter coefficients for aparticular frequency bin f, making up the noise suppression timedirection filter in that particular frequency bin. Hence, the term timedirection filter can be used to refer to the individual noisesuppression filters that filter the sub-band signals of the “cleaner”background noise component {circumflex over (N)}₂(m, f) in the timedirection.

Eq. (28) can be rewritten based on Eq. (34) as follows:

$\begin{matrix}{{{\hat{S}}_{1}\left( {m,f} \right)} = {{P\left( {m,f} \right)} - {\sum\limits_{k = 0}^{K}{{W\left( {k,f} \right)}{{\hat{N}}_{2}\left( {{m - k},f} \right)}}}}} & (35)\end{matrix}$

Substituting Eq. (35) into Eq. (27), the gradient of E_(Ŝ) ₁ withrespect to W(k, f) is calculated as:

$\begin{matrix}\begin{matrix}{{\nabla_{W{({k,f})}}\left( E_{{\hat{S}}_{1}} \right)} = {\frac{\partial E_{{\hat{S}}_{1}}}{{\partial{Re}}\left\{ {W\left( {k,f} \right)} \right\}} + {j\frac{\partial E_{{\hat{S}}_{1}}}{{\partial{Im}}\left\{ {W\left( {k,f} \right)} \right\}}}}} \\{= {{\sum\limits_{m}{{{\hat{S}}_{1}^{*}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {W\left( {k,f} \right)} \right\}}}} +}} \\{{{{{\hat{S}}_{1}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}{{\partial{Re}}\left\{ {W\left( {k,f} \right)} \right\}}} +}} \\{{{j{\sum\limits_{m}{{{\hat{S}}_{1}^{*}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {W\left( {k,f} \right)} \right\}}}}} +}} \\{{{{\hat{S}}_{1}\left( {m,f} \right)}\frac{\partial{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}{{\partial{Im}}\left\{ {W\left( {k,f} \right)} \right\}}}} \\{= {{\sum\limits_{m}{{- {{\hat{S}}_{1}^{*}\left( {m,f} \right)}}{{\hat{N}}_{2}\left( {{m - k},f} \right)}}} - {{{\hat{S}}_{1}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}} +}} \\{{{j\; {\sum\limits_{m}{{- {{\hat{S}}_{1}^{*}\left( {m,f} \right)}}j\; {{\hat{N}}_{2}\left( {{m - k},f} \right)}}}} +}} \\{{{{\hat{S}}_{1}\left( {m,f} \right)}j\; {{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \\{= {{- 2}{\sum\limits_{m}{{{\hat{S}}_{1}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}}}} \\{= {{- 2}\; {\sum\limits_{m}\left( {{P\left( {m,f} \right)} - {\sum\limits_{l = 0}^{K}{{W\left( {l,f} \right)}{{\hat{N}}_{2}\left( {{m - l},f} \right)}}}} \right)}}} \\{{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}} \\{= {{2{\sum\limits_{l = 0}^{K}{{W\left( {l,f} \right)}\left( {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - l},f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \right)}}} -}} \\{{2\left( {\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \right)}} \\{= 0}\end{matrix} & (36)\end{matrix}$

Eq. (36) is dual to Eq. (13). Similar to sub-section 2.1.2, the set ofK+1 equations (for k=0, 1, . . . K) of Eq. (36) provides a matrixequation for every frequency bin f to solve for W(k, f), where k=0, 1, .. . K:

$\begin{matrix}{{\begin{bmatrix}{\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}} & {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - 1},f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}} & \ldots & {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - K},f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}} \\{\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - 1},f} \right)}}} & {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - 1},f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - 1},f} \right)}}} & \ldots & {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - K},f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - 1},f} \right)}}} \\\vdots & \vdots & \ddots & \vdots \\{\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - K},f} \right)}}} & {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - 1},f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - K},f} \right)}}} & \ldots & {\sum\limits_{m}\hat{{N_{2}\left( {{m - K},f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - K},f} \right)}}}\end{bmatrix}\begin{bmatrix}{W\left( {0,f} \right)} \\{W\left( {1,f} \right)} \\\vdots \\{W\left( {K,f} \right)}\end{bmatrix}} = {\quad\begin{bmatrix}{\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}} \\{\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - 1},f} \right)}}} \\\vdots \\{\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - K},f} \right)}}}\end{bmatrix}}} & (37)\end{matrix}$

This solution can be written as:

R _({circumflex over (N)}) ₂ (f)· W (f)= r _(P,{circumflex over (N)}) ₂_(*)(f)  (38)

where:

$\begin{matrix}{{\underset{\underset{\_}{\_}}{R}}_{{\hat{N}}_{2}} = {\sum\limits_{m}{{{\hat{\underset{\_}{N}}}_{2}^{*}\left( {m,f} \right)} \cdot {{\hat{N}}_{2}\left( {m,f} \right)}^{T}}}} & (39) \\{{{\underset{\_}{r}}_{P,{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{P\left( {m,f} \right)} \cdot {{\hat{\underset{\_}{N}}}_{2}^{*}\left( {m,f} \right)}}}} & (40) \\{{{{\hat{\underset{\_}{N}}}_{2}\left( {m,f} \right)} = \begin{bmatrix}{{\hat{N}}_{2}\left( {m,f} \right)} \\{{\hat{N}}_{2}\left( {{m - 1},f} \right)} \\\vdots \\{{\hat{N}}_{2}\left( {{m - K},f} \right)}\end{bmatrix}},{{\underset{\_}{W}(f)} = \begin{bmatrix}{W\left( {0,f} \right)} \\{W\left( {1,f} \right)} \\\vdots \\{W\left( {K,f} \right)}\end{bmatrix}}} & (41)\end{matrix}$

and the superscript T denotes non-conjugate transpose. The solution perfrequency bin to the time direction filter is thus given by:

W (f)=( R _({circumflex over (N)}) ₂ (f))⁻¹ ·r_(P,{circumflex over (N)}) ₂ _(*)(f)  (42)

This solution appears to require a matrix inversion, but in mostpractical applications a matrix inversion is not needed.

In the embodiment where adaptive noise canceler filter 325 isimplemented based on the hybrid approach, statistics estimator 345 isconfigured to derive (or update) estimates of the statistics expressedin Eq. (39) and Eq. (40) and provide the estimates to controller 350.Controller 350 is then configured to use the estimates of the statisticsto configure adaptive noise canceler filter 325. For example, controller350 can use these values to configure adaptive noise canceler filter 325in accordance with the transfer function W(f) expressed in Eq. (42),although this is only one example.

Comparing Eq. (39) and Eq. (40) to Eq. (31) and Eq. (32), respectively,it can be seen that similar statistics are calculated by each set ofequations, except that instead of calculating statistics only betweencurrent frequency bin components of signals, the hybrid solutionrequires calculation of statistics between vectors of current and pastfrequency bin components of signals, i.e. a time dimension is now partof the statistics. At the extreme, with no DFT, i.e. a single full bandsignal (the time domain signal), the hybrid method becomes a pure timedomain method, and hence, the solution above provides the solution alsofor a pure time domain approach. The frequency index would becomeobsolete (as there is only one frequency band), and the signal vectorsin the time direction would contain the signal time domain samples. Afurther simplification in that case is that the time domain signalwithout DFT is real and not complex as in the case of the DFT bins or ifa complex sub-band analysis has been applied.

2.2.3 Alternative Approach to Adaptive Noise Canceler

As discussed above, to achieve the objective of removing the backgroundnoise component N₁(m, f) in the primary input speech signal P(m, f), thetransfer function W(f) of adaptive noise canceler filter 325 can bederived (or updated) to substantially minimize the power of the noisesuppressed primary input speech signal Ŝ₁(m, f) expressed in Eq. (27)during speech absence.

As an alternative method to achieve the objective of removing thebackground noise component N₁(m, f) in the primary input speech signalP(m, f), the transfer function W(f) of adaptive noise canceler filter325 can be derived (or updated) to substantially minimize the power ofthe difference between the desired speech component S₁(m, f) in theprimary input speech signal P(m, f) and the output of ANC 310, Ŝ₁(m, f).The power of the difference between the desired speech component S₁(m,f) and the output of ANC 310, Ŝ₁(m, f), can be expressed as:

$\begin{matrix}{E_{{\hat{S}}_{1}} = {\sum\limits_{m}{\sum\limits_{f}{\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right)\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right)^{*}}}}} & (43)\end{matrix}$

where ( )* indicates complex conjugate.

Accommodating the hybrid approach, from Eq. (43) the gradient of E_(Ŝ) ₂with respect to W (k, f) is calculated as:

$\begin{matrix}\begin{matrix}{{\nabla_{W{({k,f})}}\left( E_{{\hat{S}}_{1}} \right)} = {\frac{\partial E_{S_{1}}}{{\partial{Re}}\left\{ {W\left( {k,f} \right)} \right\}} + {j\frac{\partial E_{S_{1}}}{{\partial{Im}}\left\{ {W\left( {k,f} \right)} \right\}}}}} \\{= {{\sum\limits_{m}{\left( {{S_{1}^{*}\left( {m,f} \right)} - {{\hat{S}}_{1}^{*}\left( {m,f} \right)}} \right)\frac{- {\partial{{\hat{S}}_{1}\left( {m,f} \right)}}}{{\partial{Re}}\left\{ {W\left( {k,f} \right)} \right\}}}} +}} \\{{{\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right)\frac{- {\partial{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}}{{\partial{Re}}\left\{ {W\left( {k,f} \right)} \right\}}} +}} \\{{{j\; {\sum\limits_{m}{\left( {{S_{1}^{*}\left( {m,f} \right)} - {{\hat{S}}_{1}^{*}\left( {m,f} \right)}} \right)\frac{- {\partial{{\hat{S}}_{1}\left( {m,f} \right)}}}{{\partial{Im}}\left\{ {W\left( {k,f} \right)} \right\}}}}} +}} \\{{\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right)\frac{- {\partial{{\hat{S}}_{1}^{*}\left( {m,f} \right)}}}{{\partial{Im}}\left\{ {W\left( {k,f} \right)} \right\}}}} \\{= {{\sum\limits_{m}{\left( {{S_{1}^{*}\left( {m,f} \right)} - {{\hat{S}}_{1}^{*}\left( {m,f} \right)}} \right){{\hat{N}}_{2}\left( {{m - k},f} \right)}}} +}} \\{{{\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right){{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}} +}} \\{{{j\; {\sum\limits_{m}{\left( {{S_{1}^{*}\left( {m,f} \right)} - {{\hat{S}}_{1}^{*}\left( {m,f} \right)}} \right)j\; {{\hat{N}}_{2}\left( {{m - k},f} \right)}}}} -}} \\{{\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right)j\; {{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \\{= {2\; {\sum\limits_{m}{\left( {{S_{1}\left( {m,f} \right)} - {{\hat{S}}_{1}\left( {m,f} \right)}} \right){{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}}}} \\{= {2\; {\sum\limits_{m}{\begin{pmatrix}{{S_{1}\left( {m,f} \right)} - {P\left( {m,f} \right)} +} \\{\sum\limits_{l = 0}^{K}{{W\left( {l,f} \right)}{{\hat{N}}_{2}\left( {{m - l},f} \right)}}}\end{pmatrix}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}}}} \\{= {{2{\sum\limits_{l = 0}^{K}{{W\left( {l,f} \right)}\left( {\sum\limits_{m}{{{\hat{N}}_{2}\left( {{m - l},f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \right)}}} +}} \\{{{2\left( {\sum\limits_{m}{{S_{1}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \right)} -}} \\{{2\left( {\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}} \right)}} \\{= 0}\end{matrix} & (44)\end{matrix}$

which is written in matrix form as:

R _({circumflex over (N)}) ₂ (f)· W (f)= r _(P,{circumflex over (N)}) ₂_(*)(f)− r _(S) ₁ _(,{circumflex over (N)}) ₂ _(*)(f)  (45)

where R _({circumflex over (N)}) ₂ (f) and r _(P,{circumflex over (N)})₂ (f) are defined in sub-section 2.2.2. The last component r _(S) ₁_(,{circumflex over (N)}) ₂ (f) is given by:

$\begin{matrix}{{{\underset{...}{r}}_{S_{1},{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{S_{1}\left( {m,f} \right)} \cdot {{\hat{\underset{\_}{N}}}_{2}^{*}\left( {m,f} \right)}}}} & (46)\end{matrix}$

and depends on the desired speech component S₁(m, f) in the primaryinput speech signal P(m, f). The desired speech component S₁(m, f) isgenerally not available independent of the background noise componentN₁(m, f) it the primary input speech signal P(m, f). However, r _(S) ₁_(,{circumflex over (N)}) ₂ (f) can be calculated based on an assumptionof independence between speech and background noise. Given thisassumption, Eq. (46) can be expanded as follows:

$\begin{matrix}\begin{matrix}{{r_{S_{1},{\hat{N}}_{2}^{*}}\left( {k,f} \right)} = {\sum\limits_{m}{S_{1}{\left( {m,f} \right) \cdot {{\hat{N}}_{2}^{*}\left( {{m - k},f} \right)}}}}} \\{= {\sum\limits_{m}{\left( {{P\left( {m,f} \right)} - {N_{1}\left( {m,f} \right)}} \right) \cdot}}} \\{\begin{pmatrix}{{R\left( {{m - k},f} \right)} -} \\{\sum\limits_{l = 0}^{K}{{H\left( {l,f} \right)}{P\left( {{{m\text{?}k} - l},f} \right)}}}\end{pmatrix}^{*}} \\{= {\sum\limits_{m}{{P\left( {m,f} \right)} \cdot}}} \\{{\begin{pmatrix}{{R\left( {{m - k},f} \right)}\text{?}} \\{\sum\limits_{l = 0}^{K}{{H\left( {l,f} \right)}{P\left( {{m - k - l},f} \right)}}}\end{pmatrix}^{*} -}} \\{{\sum\limits_{m}{{N_{1}\left( {m,f} \right)} \cdot}}} \\{{\left( {{N_{2}\left( {{m - k},f} \right)} + {S_{2}\left( {{m - k},f} \right)}} \right)^{*} +}} \\{{\sum\limits_{m}{{N_{1}\left( {m,f} \right)} \cdot}}} \\{\left( {\sum\limits_{l = 0}^{K}{{H\left( {l,f} \right)}\begin{pmatrix}{{S_{1}\left( {{m - k - l},f} \right)} +} \\{N_{1}\left( {{m - k - l},f} \right)}\end{pmatrix}}} \right)^{*\;}} \\{\approx {{r_{P,R^{*}}\left( {k,f} \right)} -}} \\{{{\sum\limits_{l = 0}^{K}{H^{*}\left( {l,f} \right){r_{P,P^{*}}\left( {{k + l},f} \right)}}} -}} \\{{{r_{N_{1},N_{2}^{*}}\left( {k,f} \right)} +}} \\{{\sum\limits_{l = 0}^{K}{{H^{*}\left( {l,f} \right)}{r_{N_{1},N_{2}^{*}}\left( {{k + l},f} \right)}}}} \\{= {{r_{P,R^{*}}\left( {k,f} \right)} - {r_{N_{1},N_{2}^{*}}\left( {k,f} \right)} -}} \\{{{\sum\limits_{l = 0}^{K}{{H^{*}\left( {l,f} \right)}\begin{pmatrix}{{r_{P,P^{*}}\left( {{k + l},f} \right)} -} \\{r_{N_{1},N_{1}^{*}}\left( {{k + l},f} \right)}\end{pmatrix}}}{\text{?}\text{indicates text missing or illegible when filed}}}}\end{matrix} & (47)\end{matrix}$

For the general hybrid version, the solution is given by:

W (f)=(f))=( R _({circumflex over (N)}) ₂ (f))⁻¹·( r_(P,{circumflex over (N)}) ₂ _(*)(f)− r _(S) ₁ _(,N) ₂ _(*)(f))  (48)

and the special 0^(th) order hybrid (non-hybrid, both BM and ANC)version has the following solution:

$\begin{matrix}{{W(f)} = \frac{\begin{matrix}{{r_{P,{\hat{N}}_{2}^{*}}\left( {0,f} \right)} + {r_{N_{1},N_{2}^{*}}\left( {0,f} \right)} - {r_{P,R^{*}}\left( {0,f} \right)} +} \\{{H^{*}(f)}\left( {{r_{P,P^{*}}\left( {0,f} \right)} - {r_{N_{1},N_{1}^{*}}\left( {0,f} \right)}} \right)}\end{matrix}}{r_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}\left( {0,f} \right)}} & (49)\end{matrix}$

With a hybrid BM and non-hybrid ANC, the solution is given by:

$\begin{matrix}{{W(f)} = \frac{\begin{matrix}{{r_{P,{\hat{N}}_{2}^{*}}\left( {0,f} \right)} + {r_{N_{1},N_{2}^{*}}\left( {0,f} \right)} - {r_{P,R^{*}}\left( {0,f} \right)} +} \\{\sum\limits_{k = 0}^{K}{{H^{*}\left( {k,f} \right)}\left( {{r_{P,P^{*}}\left( {k,f} \right)} - {r_{N_{1},N_{1}^{*}}\left( {k,f} \right)}} \right)}}\end{matrix}}{r_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}\left( {0,f} \right)}} & (50)\end{matrix}$

In this alternative approach, statistics estimator 345 is configured toderive (or update) estimates of the statistics expressed in Eq. (39)and/or Eq. (40) and/or Eq. (47) and provide the estimates to controller350. Controller 350 is then configured to use the estimates of thestatistics to configure adaptive noise canceler filter 325. For example,controller 350 can use these values to configure adaptive noise cancelerfilter 325 in accordance with the transfer function W(f) expressed inEq. (48), Eq. (49), or Eq. (50).

3. Estimation of Time-Varying Statistics

As described above in sub-sections 2.1 and 2.2, the closed-formsolutions for blocking matrix filter 315 and adaptive noise cancelerfilter 325 require various statistics to be estimated. In practice,these statistics need to be estimated from the primary input speechsignal P(m, f) and the reference input speech signal R(m, f) thatcontain desired speech mixed with background noise. The statistics willgenerally vary with time due to, for example, the position of thedesired speech source relative to primary speech microphone 104 andnoise reference microphone 106 changing, the position of the backgroundnoise source(s) relative to primary speech microphone 104 and noisereference microphone 106 changing, etc. The present section describesmethods and features that will facilitate the estimation of thetime-varying statistics used to solve the closed-form solutions forblocking matrix filter 315 and adaptive noise canceler filter 325described above in sub-sections 2.1 and 2.2.

3.1 Estimation of Time-Varying Statistics for the Blocking Matrix Filter

As described above in sub-section 2.1.1, deriving (or updating) blockingmatrix filter 315 requires knowledge of the statistics C_(R,P*)(f) andC_(P,P*)(f), which can be calculated during periods of time (or frames)of predominantly desired speech. The statistics were expressed generallyin Eq. (8) and Eq. (9), reproduced below:

$\begin{matrix}{{C_{R,P^{*}}(f)} = {\sum\limits_{m}{{R\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}} & (8) \\{{{C_{P,P^{*}}(f)}\text{?}{\sum\limits_{m}{{P\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}}{\text{?}\text{indicates text missing or illegible when filed}}} & (9)\end{matrix}$

The condition that these statistics be calculated during predominantlydesired speech can be quantified to update when the energy of thedesired speech is greater than the energy of the background noise inprimary input speech signal P(m, f) by a large degree. It means thatreference input speech signal R(m, f) and primary input speech signalP(m, f) generally include primarily desired speech. Thus, thecalculation of C_(R,P*)(f) as the sum of products of the reference inputspeech signal R(m, f) and the complex conjugate primary input speechsignal P(m, f) at a given frequency bin f for some number of frames canbe seen as a way of estimating the cross-spectrum at that frequency binbetween the desired speech component in the reference input speechsignal R(m, f) and the desired speech component in the primary inputspeech signal P(m, f). Consequently, and as noted above, C_(R,P*)(f) canbe referred to as the cross-channel statistics of the desired speech, orjust desired speech cross-channel statistics.

Similarly, the calculation of C_(P,P*)(f) as the sum of products of theprimary input speech signal P(m, f) and its own complex conjugate at agiven frequency bin f for some number of frames can be seen as a way ofestimating the power spectrum at that frequency bin of the desiredspeech component in the primary input speech signal P(m, f).Consequently, and as noted above, C_(P,P*)(f) can be referred to as thedesired speech statistics of the primary input speech signal.

Collectively, the cross-channel statistics of the desired speech anddesired speech statistics of the primary input speech signal can bereferred to as simply the desired speech statistics.

To accommodate the time varying nature of C_(R,P*)(f) and C_(P,P*)(f)expressed in Eq. (8) and Eq. (9), these statistics can be estimatedusing a time window (as is done in Eq, (8) and Eq. (9)) or using amoving average, The calculation of the statistics using a moving averagecan be expressed as:

C _(R,P*)(m,f)=α(m)·C _(R,P*)(m−1,f)+(1−α(m))·R(m,f)P*(m,f)  (51)

C _(P,P*)(m,f)=α(m)·C _(P,P*)(m−1,f)+(1−α(m))·P(m,f)P*(m,f)  (52)

where ( )* indicates complex conjugate, m indexes the time or frame, findexes a particular frequency component, bin, or sub-band, and α(m) isan adaptation factor, which itself is time-varying.

It should be noted that the moving averages expressed in Eq. (51) andEq. (52), commonly referred to as exponential moving averaging orexponentially weighted moving averaging, are provided for exemplarypurposes only and are not intended to be limiting. Persons skilled inthe relevant art(s) will recognize that other moving average expressionscan be used.

The adaptation factor α(m) is adjusted in time such that it has asmaller value that is less than one and greater than zero as thelikelihood of predominantly desired speech increases, and acomparatively larger value that is closer to one as the likelihood ofpredominantly desired speech decreases. In practice this can be achievedby adjusting α(m) to a smaller value when the energy of the desiredspeech is likely greater than the energy of the background noise in acurrent frame of the primary input speech signal P(m, f) by a largedegree (resulting in C_(R,P*)(f) and C_(P,P*)(f) being updated quickly),and is adjusted in time such that is has a comparatively large value(e.g., a value around 1) when the energy of the desired speech is notlikely to be greater than the energy of the background noise in thecurrent frame of the primary input speech signal P(m, f) by a largedegree (resulting in C_(R,P*)(f) and C_(P,P*)(f) being updated slowly,or not at all when α(m) is equal to one).

The adaptation factor α(m) can be determined, for example, based on adifference in energy between a current frame of the primary input speechsignal P(m, f) received by primary speech microphone 104 and a currentframe of the reference input speech signal R(m, f) received by noisereference microphone 106. The difference in energy can be calculated bysubtracting the log-energy of the current frame of the reference inputspeech signal from the log-energy of the current frame of the primaryinput speech signal in at least one example.

For instance, if the difference in energy is 16 dB or higher (indicatinglikelihood of desired speech dominating any background noise present inthe current frame of the primary input speech signal P(m, f)), α(m) canbe set equal to a smaller value and, if the difference in energy is 6 dBor less (indicating likelihood of background noise dominating anydesired speech present in the current frame of the primary input speechsignal P(m, f)), α(m) can be set equal to a comparatively larger value,while a piecewise linear mapping from difference in energy to α(m) canbe used in-between these two values. In general, the piecewise linearmapping can be monotonically decreasing in-between the two points.

An example piecewise linear mapping 400 from difference in energybetween the primary input speech signal P(m, f) and the reference inputspeech signal R(m, f) to adaptation factor α(m) is illustrated in FIG.4. It should be noted that piecewise linear mapping 400 is provided forillustrative purposes only and is not intended to be limiting. Personsskilled in the relevant art(s) will recognize that other mappings arepossible. For example, a non-linear piecewise mapping can be used.

Using a mapping from difference in energy to α(m) as described above,generally means that the statistics expressed in Eq. (51) and Eq. (52)will be updated at a rate directly related to the difference in energybetween the primary input speech signal P(m, f) and the reference inputspeech signal R(m, f).

FIG. 5 depicts a flowchart 500 of a method for estimating thetime-varying statistics of blocking matrix filter 315, illustrated inFIG. 3, in accordance with an embodiment of the present invention. Themethod of flowchart 500 can be performed, for example and withoutlimitation, by statistics estimator 335 as described above in referenceto FIG. 3. However, the method is not limited to that implementation.

As shown in FIG. 5, the method of flowchart 500 begins at step 505 andimmediately transitions to step 510. At step 510, a current frame of theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) are received.

At step 515, a difference in energy between the current frame of theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) is calculated. For example, the difference in energy canbe calculated by subtracting the log-energy of the current frame of thereference input speech signal R(m, f) from the log-energy of the currentframe of the primary input speech signal P(m, f) in at least oneexample.

At step 520, the adaptation factor α(m) is determined, based on at leastthe difference in energy calculated at step 515. For example, theadaptation factor aim) can be determined based on a piecewise linearmapping from the difference in energy calculated at step 515 to α(m).FIG. 4 illustrates one possible piecewise linear mapping 400, althoughother non-linear mappings can be used to determine the adaptation factorα(m).

It should be noted that information other than the difference in energycalculated at step 515 can be used to determine the adaptation factorα(m). For example, a voice activity indicator provided by a voiceactivity detector (not shown) can be used in combination with thedifference in energy calculated at step 515 to determine the adaptationfactor α(m).

At step 525, the statistics used to determine blocking matrix filter 315are updated based on the previous values of the statistics, the currentframe of the primary input speech signal P(m, f) and the reference inputspeech signal R(m, f), and the adaptation factor α(m). For example, thecross-channel statistics of the desired speech C_(R,P*)(m, f) can beupdated according to Eq. (51) above using the previous value of thecross-channel statistics of the desired speech statistics C_(R,P*)(m−1,f), the current frame of the primary input speech signal P(m, f) and thereference input speech signal R(m, f), and the adaptation factor α(m).Similarly, the desired speech statistics of the primary input speechsignal C_(P,P*)(m, f) can be updated according to Eq. (52) above usingthe previous value of the desired speech statistics of the primary inputspeech signal C_(P,P*)(m−1, f), the current frame of the primary inputspeech signal P(m, f), and the adaptation factor α(m).

3.1.1 Improved Estimation of Clean Speech Statistics

If there are plenty of frames where the desired speech dominates thebackground noise in the primary input speech signal P(m, f), then evenif there is some background noise, the statistics C_(R,P*)(f) andC_(P,P*)(f) expressed by Eq. (51) and Eq. (52), respectively, can beestimated directly from the primary input speech signal P(m, f) and thereference input speech signal R(m, f) with sufficient accuracy. However,to gain robustness to higher levels of background noise, it may beadvantageous to estimate the statistics C_(R,P*)(f) and C_(P,P*)(f) in amore advanced manner. For example, the statistics of the stationaryportion of the background noise components N₁(m, f) and N₂(m, f) can befurther estimated and removed when estimating the statistics C_(R,P*)(f)and C_(P,P*)(f) as follows:

C _(R,P*)(m,f)=α(m)·C _(R,P*)(m−1,f)+(1−α(m))·[R(m,f)P*(m,f)−C _(N) ₂_(,N) ₁ _(*) ^(stationary)(m,f)]  (53)

C _(P,P*)(m,f)=α(m)·C _(P,P*)(m−1,f)+(1−α(m))·[P(m,f)P*(m,f)−C _(N) ₂_(,N) ₁ _(*) ^(stationary)(m,f)]  (54)

where C_(N) ₂ _(,N) ₁ _(*) ^(stationary)(m, f) is the cross-channelstatistics of the stationary background noise, or just stationarybackground noise cross-channel statistics, determined based on theproduct of the background noise component N₁(m, f) and the complexconjugate of N₂(m, f) at a given frequency bin f, and C_(N) ₁ _(,N) ₁_(*) ^(stationary)(m, f) is the stationary background noise statisticsof the primary input speech signal determined based on the product ofthe background noise component N₁(m, f) and its own complex conjugate ata given frequency bin f. Collectively, the cross-channel statistics ofthe stationary background noise and the stationary background noisestatistics of the primary input speech signal can be referred to assimply the stationary background noise statistics.

More specifically, the statistics, C_(N) ₂ _(,N) ₁ _(*) ^(stationary)(m,f) and C_(N) ₁ _(,N) ₁ _(*) ^(stationary)(m, f) can be estimated from amoving average of input statistics as follows:

C _(N) ₂ _(,N) ₁ _(*) ^(stationary)(m,f)=α_(S)(m)·C _(N) ₂ _(,N) ₁ _(*)^(stationary)(m−1,f)+(1−α_(S)(m))·[R(m,f)P*(m,f)]  (55)

C _(N) ₁ _(,N) ₁ _(*) ^(stationary)(m,f)=α_(S)(m)·C _(N) ₁ _(,N) ₁ _(*)^(stationary)(m−1,f)+(1−α_(S)(m))·[P(m,f)P*(m,f)]  (56)

where α_(S)(m) is an adaptation factor.

It should be noted that the moving averages expressed in Eq. (55) andEq. (56), commonly referred to as exponential moving averaging, areprovided for exemplary purposes only and are not intended to belimiting. Persons skilled in the relevant art(s) will recognize thatother moving average expressions can be used.

The adaptation factor α_(S)(m) can be determined, for example, based ona difference in energy between a current frame of the primary inputspeech signal P(m, f) and a current frame of the reference input speechsignal R(m, f). For instance, if the difference in energy is −3 dB orless (indicating likelihood of background noise dominating any desiredspeech in the current frame of the primary input speech signal P(m, f)),α_(S)(m) can be set equal to a small value between zero and one and, ifthe difference in energy is 6 dB or higher (indicating likelihood ofdesired speech dominating any background noise present in primary inputspeech signal P(m, f)), α_(S)(m) can be set equal to a comparativelylarger value close to one (or exactly equal to one), while a piecewiselinear mapping from difference in energy to α_(S)(m) can be usedin-between these two values. In general, the piecewise linear mappingcan be monotonically increasing in-between the two points.

An example piecewise linear mapping 600 from difference in energybetween the primary input speech signal P(m, f) and the reference inputspeech signal R(m, f) to adaptation factor α_(S)(m) is illustrated inFIG. 6. Compared to the mapping for α(m) above it can be seen thatdifferent points are used to suggest certain likelihood of speech andnoise. Such differences are generally present due to a desire tobias/err in certain directions depending on the usage of theinformation. It should be noted that piecewise linear mapping 600 isprovided for illustrative purposes only and is not intended to belimiting. Persons skilled in the relevant art(s) will recognize thatother mappings are possible. For example, a non-linear, piecewisemapping can be used.

Using a mapping from difference in energy to α_(S)(m) as describedabove, generally means that the statistics expressed in Eq. (55) and Eq.(56) will be updated at a rate inversely related to the difference inenergy between the primary input speech signal P(m, f) and the referenceinput speech signal R(m, f).

FIG. 7 depicts a flowchart 700 of a method for estimating thetime-varying stationary background noise statistics in accordance withan embodiment of the present invention. The method of flowchart 700 canbe performed, for example and without limitation, by statisticsestimator 335 as described above in reference to FIG. 3. However, themethod is not limited to that implementation.

As shown in FIG. 7, the method of flowchart 700 begins at step 705 andimmediately transitions to step 710. At step 710, a current frame of theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) are received.

At step 715, a difference in energy between the current frame of theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) is calculated. For example, the difference in energy canbe calculated by subtracting the log-energy of the current frame of thereference input speech signal R(m, f) from the log-energy of the currentframe of the primary input speech signal P(m, f) in at least oneexample.

At step 720, the adaptation factor α_(S)(m) is determined, based on atleast the difference in energy calculated at step 715. For example, theadaptation factor α_(S)(m) can be determined based on a piecewise linearmapping from the difference in energy calculated at step 715 toα_(S)(m). FIG. 6 illustrates one possible piecewise linear mapping 600,although other non-linear mappings can be used to determine theadaptation factor α_(S)(m).

It should be noted that information other than the difference in energycalculated at step 715 can be used to determine the adaptation factorα_(S)(m). For example, a voice activity indicator provided by a voiceactivity detector (not shown) can be used in combination with thedifference in energy calculated at step 715 to determine the adaptationfactor α_(S)(m).

At step 725, the stationary background noise statistics are updatedbased on the previous values of the stationary background noisestatistics, the current frame of the primary input speech signal P(m, f)and the reference input speech signal R(m, f), and the adaptation factorα_(S)(m). For example, the stationary background noise cross-channelstatistics C_(N) ₂ _(,N) ₁ _(*) ^(stationary)(m, f) can be updatedaccording to Eq. (55) above using the previous value of the stationarybackground noise cross-channel statistics C_(N) ₂ _(,N) ₁ _(*)^(stationary)(m−1, f), the current frame of the primary input speechsignal P(m, f) and the reference input speech signal R(m, f), and theadaptation factor α_(S)(m). Similarly, the stationary background noisestatistics of the primary input speech signal C_(N) ₁ _(,N) ₁ _(*)^(stationary)(m, f) can be updated according to Eq. (56) above using theprevious value of the stationary background noise statistics of theprimary input speech signal C_(N) ₁ _(,N) ₁ _(*) ^(stationary)(m−1, f),the current frame of the primary input speech signal P(m, f), and theadaptation factor α_(S)(m).

3.1.2 Local Variations in Microphone Levels due to Acoustic Factors

In operation of multi-channel noise suppression system 300 illustratedin FIG. 3, it is possible for one or both of primary speech microphone104 and noise reference microphone 106 to become shielded for atemporary amount of time. For example, a finger or hair can partiallyshield primary speech microphone 104 or noise reference microphone 106for some indeterminate period of time. As a result, the energy of theinput speech signal received by the shielded microphone may be below theenergy of the input speech signal that would otherwise have beenreceived if it were not shielded. This variation can undermine theeffectiveness of using the difference in energy between the primaryinput speech signal P(m, f) received by primary speech microphone 104and the reference input speech signal R(m, f) received by noisereference microphone 106 to determine the adaptatiot factors andtime-varying statistics as discussed above in the precedingsub-sections. Therefore, it can be beneficial to take this variationinto account.

In one potential solution to take this variation into account, localvariations in the level of primary speech microphone 104 and noisereference microphone 106 due to acoustical factors can be respectivelycalculated based on the following moving averages:

M _(P) ^(lev)(m)=α_(S) ·M _(P) ^(lev)(m−1)+(1−α_(S))·M _(P)(m)  (57)

M _(R) ^(lev)(m)=α_(S) ·M _(R) ^(lev)(m−1)+(1−α_(S))·M _(R)(m)  (58)

where α_(S) is determined based on the piecewise linear mapping in FIG.6, and M_(P)(m) and M_(R)(m) respectively represent the energies orlevels of primary input speech signal P(m, f) and reference input speechsignal R(m, f) and are given by:

$\begin{matrix}{{M_{P}(m)}\text{?}{10 \cdot {\log_{10}\left( {\sum\limits_{f}{{P\left( {m,f} \right)}}^{2}} \right)}}} & (59) \\{{{M_{R}(m)} = {10 \cdot {\log_{10}\left( {\sum\limits_{f}{{R\left( {m,f} \right)}}^{2}} \right)}}}{\text{?}\text{indicates text missing or illegible when filed}}} & (60)\end{matrix}$

The difference between the moving averages expressed in Eq. (59) and Eq.(60) can then be used to compensate for any variation in the microphoneinput levels due to acoustical factors. For example, the function usedto map the difference in energy of the primary input speech signal P(m,f) and the reference input speech signal R(m, f) to the adaptationfactor α(m) can be offset by the difference between the moving averagesexpressed in Eq. (59) and Eq. (60) to provide compensation. Assuming themapping function illustrated in the plot of FIG. 4 is used, the offsetcan be seen as a shift of each point (either left or right) in the plotby the estimated effective loss.

3.1.3 Accommodating Changes in Acoustic Coupling Specific to PrimarySpeech

In operation of multi-channel noise suppression system 300 illustratedin FIG. 3, it is further possible for the desired speech source to moverelative to primary speech microphone 104 and noise reference microphone106, thereby changing the acoustic coupling, between the desired speechsource and the two microphones. For instance, in the example wheremulti-channel noise suppression system 300 is implemented in wirelesscommunication device 102, illustrated in FIG. 1, a user can make minoradjustments to the position of wireless communication device 102 duringa call, such as by moving the wireless communication device 102 closeror farther away from his or her mouth. These adjustments in position cansignificantly change the acoustic coupling between the user's mouth andthe two microphones. As a result, the energy of the desired speechcomponent within the input speech signals received by the twomicrophones may be increased or reduced artificially based on the changein position. This variation in the energy of the desired speechcomponent received can undermine the effectiveness of using thedifference in energy between the primary input speech signal P(m, f) andthe reference input speech signal R(m, f) to determine the adaptationfactors and time-varying statistics as discussed above in the precedingsub-sections. Therefore, it may be beneficial to take this potentialvariation into account.

In one potential solution to take this potential variation into account,a moving average is maintained of the difference in energy of a currentframe of the primary input speech signal P(m, f) and a current frame ofthe reference input speech signal R(m, f) and compared to a referencevalue. More specifically, the moving average is updated based on thedifference in energy between a current frame of the primary input speechsignal P(m, f) and a current frame of the reference input speech signalR(m, f) if the frame of the primary input speech signal P(m, f) isindicated as including desired speech. The degree to which the movingaverage is updated based on each frame can be controlled using asmoothing factor. For example, the smoothing factor can be set to avalue that updates the moving average to be equal to 0.99 of theprevious moving average value and 0.01 of the difference in energy ofthe current frame of the primary input speech signal P(m, f) and thecurrent frame of the reference input speech signal R(m, f), assuming thecurrent frame of the primary input speech signal P(m, f) is indicated asincluding desired speech.

The reference value, to which the moving average is compared, can bedetermined as a typical difference in energy between the primary inputspeech signal P(m, f) and the reference input speech signal R(m, f) fordesired speech when the desired speech source is in its nominal (i.e.,intended) position relative to the two microphones.

As an example of this feature, if the user's mouth is in its nominalposition relative to the two microphones of wireless communicationdevice 102 during a call, the presence of desired speech may be highlylikely if the difference in energy between the primary input speechsignal P(m, f) and the reference input speech signal R(m, f) is above 10dB. On the other hand, if the user's mouth is not in its nominalposition relative to the two microphones of wireless communicationdevice 102 during a call (e.g., the user's mouth is farther away from atleast primary speech microphone 104), then the presence of desiredspeech may be highly likely if the difference in energy between theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) is above 6 dB. Thus, there is an effective loss incoupling of 4 dB for the desired speech because of the mismatch in theposition of the user's mouth during the call from its nominal positionrelative to the two microphones. It should be noted that although thecoupling for desired speech was reduced by 4 dB by moving the handsetinto a suboptimal position, the coupling for noise sources remains aboutthe same (as they are far-field to the device for all practicalpurposes). Hence, this change in coupling only applies to desiredspeech.

By keeping track of a moving average of the difference in energy of theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) for desired speech as discussed above, and comparing themoving average to a reference value as further discussed above, theeffective loss due to suboptimal acoustic coupling for the desiredspeech can be estimated. This estimated effective loss can then be usedto compensate for any actual loss due to suboptimal acoustic couplingfor the desired speech. For example, the function used to map thedifference in energy of the primary input speech signal P(m, f) and thereference input speech signal R(m, f) to the adaptation factor α(m) canbe offset by the estimated effective loss to provide compensation.Assuming the mapping function illustrated in the plot of FIG. 4 is used,the offset can be seen as a shift of each point (either left or right)in the plot by the estimated effective loss.

In order to update the moving average based on the difference in energyof a current frame of the primary input speech signal P(m, f) and acurrent frame of the reference input speech signal R(m, f) when desiredspeech is indicated to be present in the frame of the primary inputspeech signal P(m, f), it is obviously necessary to first identify thepresence of desired speech. This can be done using several methods. Forexample, the presence of desired speech can be determined based onwhether: (1) an SNR of the primary input speech signal P(m, f) is abovea certain threshold; (2) a difference in energy of the primary inputspeech signal P(m, f) and the reference input speech signal R(m, f) isabove a certain threshold; and/or (3) a prediction gain of the referenceinput speech signal R(m, f) from the primary input speech signal P(m, f)using a blocking matrix with a null forced in the direction of theexpected desired speech is above a certain threshold. In one embodiment,at least two of these methods are used to determine the presence ofdesired speech in a frame of the primary input speech signal P(m, f).

3.2 Estimation of Time-Varying Statistics for the Adaptive NoiseCanceler

As described above in sub-section 2.2.1, deriving (or updating) adaptivenoise canceler filter 325 requires knowledge of the statisticsC_({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)(f) andC_(P,{circumflex over (N)}) ₂ _(*)(f). The statistics were expressedgenerally in Eq. (31) and Eq. (32), reproduced below:

$\begin{matrix}{{C_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} & (31) \\{{C_{P,{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} & (32)\end{matrix}$

C_({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)(f),expressed in Eq. (31), is given by the sum of products of the “cleaner”background noise component {circumflex over (N)}₂(m, f) and its owncomplex conjugate at a given frequency bin f for some number of frames(i.e., the power spectrum of the “cleaner” background noise component{circumflex over (N)}₂(m, f)) and can be referred to as the backgroundnoise statistics. C_(P,{circumflex over (N)}) ₂ _(*)(f) expressed in Eq.(32), is given by the sum of the products of the primary input speechsignal P(m, f) and the complex conjugate “cleaner” background noisecomponent {circumflex over (N)}₂(m, f) at a given frequency bin f forsome number of frames (i.e., the cross-spectrum at that frequency binbetween the primary input speech signal P(m, f) and the complexconjugate “cleaner” background noise component {circumflex over (N)}₂(m,f)) and can be referred to as the cross-channel background noisestatistics.

To accommodate the time varying nature of C_({circumflex over (N)}) ₂_(,{circumflex over (N)}) ₂ _(*)(f) and C_(P,{circumflex over (N)}) ₂_(*)(f) expressed in Eq. (31) and Eq. (32), these statistics can beestimated using a time window (as is done in Eq. (31) and Eq. (32)) orusing a moving average. The calculation of the statistics using a movingaverage can be expressed as:

C _(P,{circumflex over (N)}) ₂ _(*)(m,f)=γ(m)·C_(P,{circumflex over (N)}) ₂ _(*)(m−1,f)+(1−γ(m))·P(m,f){circumflex over(N)} ₂*(m,f)  (61)

C _({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂_(*)(m,f)=γ(m)·C _({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂_(*)(m−1,f)+(1−γ(m))·{circumflex over (N)} ₂(m,f){circumflex over (N)}₂*(m,f)  (62)

where ( )* indicates complex conjugate, m indexes the time or frame, findexes a particular frequency component or sub-band, and γ(m) is anadaptation factor.

It should be noted that the moving averages expressed in Eq. (61) andEq. (62), commonly referred to as exponential moving averages orexponentially weighted moving averages, are provided for exemplarypurposes only and are not intended to be limiting. Persons skilled inthe relevant art(s) will recognize that other moving average expressionscan be used.

If BM 305 is operating well and providing the “cleaner” background noisecomponent {circumflex over (N)}₂(m, f) with little or no residual amountof the desired speech component S₂(m, f), then the adaptation factorγ(m) can be set to a constant. However, if BM 305 is not operatingperfectly and a residual amount of the desired speech component S₂(m, f)is left in the “cleaner” background noise component {circumflex over(N)}₂(m, f), setting the adaptation factor γ(m) to a constant can resultin distortion or cancellation of the desired speech. Therefore, theadaptation factor γ(m) can be varied over time according to thelikelihood of desired speech being present, and the updating of thestatistics expressed in Eq. (61) and in Eq. (62) can be effectivelyhalted when the likelihood of desired speech being present is high.

For the statistics used to derive (or update) blocking matrix filter315, the difference in energy between a current frame of the primaryinput speech signal P(m, f) and a current frame of the reference inputspeech signal R(m, f) was used as an indicator of speech presence and asan input parameter to determine the adaptation factor α(m). In a similarmanner, the difference in energy between a current frame of the primaryinput speech signal P(m, f) and a current frame of the reference inputspeech signal R(m, f) can be used as an indicator of speech presence andas an input parameter to determine the adaptation factor γ(m). However,given that BM 305 removed desired speech from reference input speechsignal R(m, f) (at least partially) to produce the “cleaner” backgroundnoise component {circumflex over (N)}₂(m, f), the difference in energy,or a moving average of the difference in energy, between a current frameof the primary input speech signal P(m, f) and a current frame of the“cleaner” background noise component {circumflex over (N)}₂(m, f) canalternatively be used as an indicator of speech presence and as an inputparameter to determine the adaptation factor γ(m). In fact, using the“cleaner” background noise component {circumflex over (N)}₂(m, f) asopposed to the reference input speech signal R(m, f) can provide betterdiscrimination, assuming BM 305 is functioning well.

As mentioned above, the statistics expressed in Eq. (61) and Eq. (62)for adaptive noise canceler filter 325 represent statistics of thebackground noise. Thus, the rate at which the statistics are updatedwill affect the ability of the overall noise suppression system to trackand suppress moving background noise sources, e.g. a talking personwalking by, a moving vehicle driving by, etc. Updating the statisticsexpressed in Eq. (61) and Eq. (62) at a fast pace will allow goodtracking and suppression of moving noise sources. On the other hand, afast update pace can potentially degrade steady-state suppression ofstationary background noise sources. Therefore, a method referred to asdual adaptive noise cancelation can be used, where a set of statisticsare maintained and updated at a fast rate (favoring moving noisesources) and a set of statistics are maintained and updated a slow rate(favoring steady-state performance). Prior to applying adaptive noisecanceler filter 325, one of the two sets of statistics is selected andused to configure the filter.

For example, the following two sets of the statistics expressed in Eq.(61) and Eq. (62) can be maintained

C _(P,{circumflex over (N)}) ₂ _(*) ^(fast)(m,f)=γ_(fast)(m)·C_(P,{circumflex over (N)}) ₂ _(*)^(fast)(m−1,f)+(1−γ_(fast)(m))·P(m,f){circumflex over (N)} ₂*(m,f)  (63)

C _({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)^(fast)(m,f)=γ_(fast)(m)·C _({circumflex over (N)}) ₂_(,{circumflex over (N)}) ₂ _(*)^(fast)(m−1,f)+(1−γ_(fast)(m))·{circumflex over (N)} ₂(m,f){circumflexover (N)} ₂*(m,f)  (64)

and

C _(P,{circumflex over (N)}) ₂ _(*) ^(slow)(m,f)=γ_(slow)(m)·C_(P,{circumflex over (N)}) ₂ _(*)^(slow)(m−1,f)+(1−γ_(fast)(m))·P(m,f){circumflex over (N)} ₂*(m,f)  (65)

C _({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)^(slow)(m,f)=γ_(slow)(m)·C _({circumflex over (N)}) ₂_(,{circumflex over (N)}) ₂ _(*)^(slow)(m−1,f)+(1−γ_(slow)(m))·{circumflex over (N)} ₂(m,f){circumflexover (N)} ₂*(m,f)  (66)

where Eq. (63) and Eq. (64) represent the set of statistics updated at afast rate (hence, the use of the fast adaptation factor γ_(fast)(m), andEq. (65) and Eq. (66) represent the set of statistics updated at a slowrate (hence, the use of the slow adaptation factor γ_(slow)(m)).

As discussed above, the adaptation factors γ_(fast)(m) and γ_(slow)(m)can be determined, for example, based on the difference in energy, or amoving average of the difference in energy, between a current frame ofthe primary input speech signal P(m, f) and a current frame of the“cleaner” background noise component {circumflex over (N)}₂(m, f). FIG.8 illustrates example piecewise linear mappings 805 and 810 that can beused to map the difference in energy (or moving average of thedifference in energy) between a current frame of the primary inputspeech signal P(m, f) and a current frame of the “cleaner” backgroundnoise component {circumflex over (N)}₂(m, f) to the adaptation factorγ(m). More specifically, piecewise linear mapping 805 provides a mappingfrom the difference in energy (or moving average of the difference inenergy) between a current frame of the primary input speech signal P(m,f) and a current frame of the “cleaner” background noise component{circumflex over (N)}₂(m, f) to the fast adaptation factor γ_(fast)(m).Piecewise linear mapping 810, on the other hand, provides a mapping fromthe difference in energy (or moving average of the difference in energy)between a current frame of the primary input speech signal P(m, f) and acurrent frame of the “cleaner” background noise component {circumflexover (N)}₂(m, f) to the slow adaptation factor γ_(slow)(m).

In general, both mappings set the adaptation factor γ(m) to a largevalue (e.g., a value of one) if the difference in energy (or movingaverage of the difference in energy) between a current frame of theprimary input speech signal P(m, f) and a current frame of the “cleaner”background noise component {circumflex over (N)}₂(m, f) is greater thana certain, predetermined value (indicating a strong likelihood ofdesired speech dominating background noise), and to a smaller valuegreater than zero and smaller than one if the difference in energy (ormoving average of the difference in energy) between the current frame ofthe primary input speech signal P(m, f) and the current frame of the“cleaner” background noise component {circumflex over (N)}₂(m, f) isless than a certain, predetermined value (indicating a strong likelihoodof background noise dominating desired speech), while a piecewise linearmapping can be used in-between the two predetermined values.

Using a mapping as described above, generally means that the statisticsexpressed in Eq. (63), Eq. (64), Eq. (65), and Eq. (66) will be updatedat a rate inversely related to the difference in energy (or moving,average of the difference in energy) between the primary input speechsignal P(m, f) and the “cleaner” background noise component {circumflexover (N)}₂(m, f).

Prior to applying adaptive noise canceler filter 325, one of the twosets of statistics needs to be selected for calculating its transferfunction. In at least one embodiment, the set of statistics (i.e.,either the fast or slow version) that results in adaptive noise cancelerfilter 325 producing an output signal with the least amount of power isselected. The output power of adaptive noise canceler filter 325 usingeach set of statistics can be expressed as:

$\begin{matrix}{E_{fast} = {\sum\limits_{f}{{{P\left( {m,f} \right)} - {{W^{fast}(f)}{{\hat{N}}_{2}\left( {m,f} \right)}}}}^{2}}} & (67) \\{E_{slow} = {\sum\limits_{f}{{{P\left( {m,f} \right)} - {{W^{slow}(f)}{{\hat{N}}_{2}\left( {m,f} \right)}}}}^{2}}} & (68)\end{matrix}$

where

$\begin{matrix}{{W^{fast}(f)} = \frac{C_{P,{\hat{N}}_{2}^{*}}^{fast}(f)}{C_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}^{fast}(f)}} & (69) \\{{W^{slow}(f)} = \frac{{CP}_{P,{\hat{N}}_{2}^{*}}^{slow}(f)}{C_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}^{slow}(f)}} & (70)\end{matrix}$

Hence, the final adaptive noise canceler filter 325 is selectedaccording to:

$\begin{matrix}{{W(f)} = \left\{ \begin{matrix}{W^{fast}(f)} & {E_{fast} < E_{slow}} \\{W^{slow}(f)} & {otherwise}\end{matrix} \right.} & (71)\end{matrix}$

FIG. 9 depicts a flowchart 900 of a method for estimating thetime-varying statistics of adaptive noise canceler filter 325,illustrated in FIG. 3, in accordance with an embodiment of the presentinvention. The method of flowchart 900 can be performed, for example andwithout limitation, by statistics estimator 345 as described above inreference to FIG. 3. However, the method is not limited to thatimplementation.

As shown in FIG. 9, the method of flowchart 900 begins at step 905 andimmediately transitions to step 910. At step 910, a current frame of theprimary input speech signal P(m, f) and the “cleaner” background noisecomponent {circumflex over (N)}₂(m, f) are received.

At step 915, a difference in energy between the current frame of theprimary input speech signal P(m, f) and the “cleaner” background noisecomponent {circumflex over (N)}₂(m, f) is calculated. Alternatively, amoving average of the difference in energy between the primary inputspeech signal P(m, f) and the “cleaner” background noise component{circumflex over (N)}₂(m, f) is updated based on the current frame ofeach signal.

At step 920, the adaptation factors γ_(slow)(m) and γ_(fast)(m) aredetermined based on at least the difference in energy between thecurrent frames of the primary input speech signal P(m, f) and thereference input speech signal R(m, f) calculated at step 915. Forexample, the adaptation factor γ_(slow)(m) and γ_(fast)(m) can berespectively determined based on piecewise linear mappings 805 and 810illustrated in FIG. 8, although other mappings can be used to determinethe adaptation factors. Alternatively, the adaptation factorsγ_(slow)(m) and γ_(fast)(m) are determined based on at least the movingaverage of the difference in energy between the primary input speechsignal. P(m, f) and the “cleaner” background noise component {circumflexover (N)}₂(m, f). It should be noted that information other than thedifference in energy calculated at step 915 or the moving average of thedifference in energy can be used to determine the adaptation factorsγ_(slow)(m) and γ_(fast)(m). For example, a voice activity indicatorprovided by a voice activity detector (not shown) can be used incombination with either the difference in energy calculated at step 915or the moving average of the difference in energy to determine theadaptation factors γ_(slow)(m) and γ_(fast)(m).

At step 925, the statistics used to determine adaptive noise cancelerfilter 325 are updated based on the previous values of the statistics,the current frame of the primary input speech signal P(m, f) and the“cleaner” background noise component {circumflex over (N)}₂(m, f), andthe adaptation factors γ_(slow)(m) and γ_(fast)(m). For example, thestatistics can be updated according to Eq. (63), Eq. (64), Eq. (65), andEq. (66) above.

3.3 Automatic Microphone Calibration

Automatic microphone calibration can be further included inmulti-channel noise suppression system 300 illustrated in FIG. 3 toestimate, for example, variations in the sensitivity of primary speechmicrophone 104 and noise reference microphone 106. This is an importantfunction since the sensitivity of the microphones can vary, for example,by as much as ±3 dB, resulting in a maximum variation of ±6 dB. Such alarge variation can undermine the effectiveness of using the differencein energy between the primary input speech signal P(m, f) and thereference input speech signal R(m, f) to determine the adaptationfactors and time-varying statistics as discussed above in the precedingsub-sections. In performing automatic microphone calibration, it isimportant to only capture differences in the sensitivity of primaryspeech microphone 104 and noise reference microphone 106 due toproduction variations and/or aging, and not due to other factors, suchas the direction or distance of a background noise source, shielding ofone or both microphones (e.g., by a finger or hair), etc.

FIG. 10 illustrates an exemplary variation 1000 of multi-channel noisesuppression system 300 that further implements an automatic microphonecalibration scheme in accordance with an embodiment of the presentinvention. More specifically, multi-channel noise suppression system1000 further includes a microphone mismatch estimator 1005 forestimating a difference in sensitivity between primary speech microphone104 and noise reference microphone 106, and a microphone mismatchcompensator 1010 to compensate for this estimated difference.

More specifically, microphone mismatch estimator 1005 determines andupdates a current estimate of the difference in sensitivity betweenprimary speech microphone 104 and noise reference microphone 106 byexploiting the knowledge that in diffuse sound fields (or when thedevice is far-field relative to a source) the energy of the signalsreceived by primary speech microphone 104 and noise reference microphone106 should be approximately equal, as well as the fact that aging of thetwo microphones is a slow process. Therefore, determining when the twomicrophones are in a diffuse sound field should provide a robust methodfor updating a current estimate of the difference in sensitivity betweenthe two microphones. The identification of a diffuse sound field can becarried out in several different ways.

For example, one potential method for determining if the two microphonesare in a diffuse sound field is to fix the phase according to a specificdirection, calculate the corresponding optimal gain for maximumprediction of the signal received by noise reference microphone 106 fromthe signal received by primary speech microphone 104, and measure theprediction gain. By carrying these steps out for a variety of phasescorresponding to a variety of directions, and comparing the predictiongains in different directions, it is possible to determine if sound iscoming from multiple directions (indicating a diffuse sound field) orfrom a well-defined direction.

An alternative or supporting method is to assume diffuse noise when theenergy of the signals received by both microphones are within some rangeof their respective minimum levels (representing the acoustic noisefloor on each microphone). The lowest level is generally a result ofdiffuse environmental ambient noise (as long as it is above the noisefloor of non-acoustic noise sources), and hence suitable for updating acurrent estimate of the difference in sensitivity between primary speechmicrophone 104 and noise reference microphone 106.

Additionally, updating of the sensitivity mismatch generally should beavoided when circuit noise, such as thermal noise, dominates. Such noiseis picked up after the microphones, electronically rather thanacoustically, and consequently is not reflective of the sensitivity ofthe microphones. Because thermal noise is generally incoherent betweenthe signal paths of the two microphones, it can be mistaken for adiffuse sound field suitable for tracking the sensitivity mismatch. Toprevent updating when such noise dominates, an absolute lower level canbe established under which no updating or tracking is performed. Othernon-acoustic noise sources that should be omitted for tracking of themicrophone sensitivity mismatch include wind noise.

Moreover, the expected range of microphone sensitivity mismatch cangenerally be determined from specifications provided by the microphonemanufacturer. Therefore, as a safeguard from divergence of thesensitivity mismatch estimation, the sensitivity mismatch can be updatedonly if the observed mismatch (without sensitivity mismatchcompensation) is below the sum of the microphone production tolerancesplus a suitable bias term. The bias term can be used to make sure theestimated microphone sensitivity mismatch can span the entire variation.

After determining a suitable time to update the sensitivity mismatchusing, for example, one or more of the methods discussed above,microphone mismatch estimator 1005 actually updates the currentestimated value of the sensitivity mismatch. Microphone mismatchestimator 1005 can update the current estimated value of the sensitivitymismatch based on the difference in energy between a current frame ofthe primary input speech signal P(m, f) and a current frame of thereference input speech signal R(m, f) during the suitable time. Forexample, microphone mismatch estimator can update the current estimatedvalue of the sensitivity mismatch based on the difference in energybetween the current frame of the primary input speech signal P(m, f) andthe current frame of the reference input speech signal R(m, f) duringthe suitable time in accordance with the following moving averageexpression:

M ^(cal)(m)=β_(cal) ·M ^(cal)(m−1)+(1−βcal)·M _(diff)(m)  (72)

where M^(cal)(m) is the current estimated value of the acousticsensitivity mismatch, M^(cal)(m−1) is the previous estimated value ofthe acoustic sensitivity mismatch, M_(diff)(m) is the difference inenergy between the current frame of the primary input speech signal P(m,f) and the current frame of the reference input speech signal R(m, f)calculated during the suitable time, and β_(cal) is a smoothing factor.The difference in energy can be calculated by subtracting the log-energyof the current frame of the reference input speech signal from thelog-energy of the current frame of the primary input speech signal in atleast one example.

In general, the objective of automatic microphone calibration is totrack long term changes and variation in acoustic sensitivity.Therefore, a value close to (but smaller than) one for the smoothingfactor β_(cal) can be used to introduce long term averaging. However, avalue close to one will also result in slow initial convergence and itmay be advantageous to vary the smoothing factor β_(cal) such that ithas a smaller value immediately following a reset of the currentestimated value of the sensitivity mismatch M^(cal)(m) and graduallyincreasing it to a value close to one as updates are performed.

The current estimated value of the sensitivity mismatch M^(cal)(m) ispassed on to microphone mismatch compensator 1010 and is used bymicrophone mismatch compensator 1010 to scale reference input speechsignal R(m, f) to compensate for any mismatch. The scaled version ofreference input speech signal R(m, f) is denoted in FIG. 10 by thesignal {circumflex over (R)}(m, f). It should be noted, however, thatthe reference input speech signal R(m, f) is chosen to be scaled inmulti-channel noise suppression system 1000 for illustrative purposesonly and is not intended to be limiting. Persons skilled in the relevantart(s) will recognize that the primary input speech signal P(m, f) canbe scaled to compensate for the estimated difference, or both theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) can be scaled to compensate for the estimated difference.

In another embodiment, rather than scaling the primary input speechsignal P(m, f) and/or the reference input speech signal R(m, f) basedthe current estimated value of the sensitivity mismatch M^(cal)(m), thecurrent estimated value of the sensitivity mismatch M^(cal)(m) can beused as an additional input to control the update of the time-varyingstatistics as described above in the preceding sub-sections.

FIG. 11 depicts a flowchart 1100 of a method for updating the currentestimated value of the sensitivity mismatch in accordance with anembodiment of the present invention. The method of flowchart 1100 can beperformed, for example and without limitation, by microphone mismatchestimator 1005 as described above in reference to FIG. 10. However, themethod is not limited to that implementation.

As shown in FIG. 11, the method of flowchart 1100 begins at step 1105and immediately transitions to step 1110. At step 1110, a current frameof the primary input speech signal P(m, f) and the reference inputspeech signal R(m, f) are received.

At step 1115, the presence of a diffuse sound field is identified (atleast in part) based on the current frame of the primary input speechsignal P(m, f) and the reference input speech signal R(m, f) using, forexample, one or more of the methods described above in regard to FIG.10.

At step 1120, a difference in energy between the current frame of theprimary input speech signal P(m, f) and the reference input speechsignal R(m, f) is calculated.

At step 1125, if the presence of a diffuse sound field is identified atstep 1115, the current estimated value of the sensitivity mismatch isupdated based on the previous estimated value of the sensitivitymismatch and the calculated difference in energy determined at step1120. For example, the current estimated value of the sensitivitymismatch can be updated according to Eq. (72) above.

Instead of carrying out microphone mismatch estimation and compensationas detailed above, it is possible to instead track the (diffuse) noiselevels on the two microphones, and then instead of using the leveldifference on the two microphones to control the estimation ofstatistics, use the level difference on the two microphones normalizedby their respective (diffuse) noise levels to control the estimation ofstatistics. This would result in the use of the SNR difference on thetwo microphones instead of the level difference being used to controlthe estimation of statistics. Hence, wherever level difference isreferred as an input for means of controlling update of statistics, itshould be understood that a corresponding SNR difference can be used asan alternative, thereby effectively carrying out microphone mismatchcompensation implicitly.

4. Variations

4.1 Frequency Dependent Adaptation Factor

As can be seen in section 3 above, the estimation of the time-varyingstatistics used to derive (or update) blocking matrix filter 315 andadaptive noise canceler filter 325 can be controlled by the full-bandenergy difference of various signals (e.g., the full-band energydifference of primary input speech signal P(m, f) and reference inputspeech signal R(m, f)). However, improved performance can be expected byallowing the update control of the time-varying statistics to have somefrequency resolution.

For example, the update control can be based on frequency dependentenergy differences. More specifically, the adaptation factors (which areused as an update control) can become frequency dependent according tothe mapping from the frequency dependent energy differences toadaptation factors. The advantage of this can be seen intuitively from asimple example. Assume that desired speech only has content below 1500Hz and background noise only has content above 2000 Hz. With thefull-band energy difference, the algorithm will try to come up with afull-band likelihood of desired speech presence. This likelihood willdepend on the relative energies of the desired speech and backgroundnoise. On the other hand, if frequency dependent update control isimplemented, then updates can be done with likelihood of desired signalpresence being one below 1500 Hz and zero above 2000 Hz, and both speechstatistics for blocking matrix filter 315 and noise statistics foradaptive noise canceler filter 325 can be updated more optimally.

4.2 Switched Blocking Matrix and Adaptive Noise Canceler

When desired speech is absent in the primary input speech signal P(m,f), the speech statistics for blocking matrix filter 315 generally arenot updated and the filter remains unchanged. This means that the“cleaner” background noise component {circumflex over (N)}₂(m, f),produced (in part) by blocking matrix filter 315, during desired speechabsence will not only include the background noise component N₂(m, f) ofthe reference input speech signal R(m, f), but also an additive filteredcomponent of the primary input speech signal P(m, f), which containsonly background noise and no desired speech. This additive filteredcomponent can effectively complicate the task of adaptive noise cancelerfilter 325 to the point of the filter providing significantly reducednoise suppression compared to disabling blocking matrix filter 315during desired speech absence. Therefore, it can be advantageous tooperate a switched structure, where blocking matrix filter 315 can bedisabled during desired speech absence.

To accommodate such a switched structure, multiple copies of thetime-varying statistics used to derive (or update) adaptive noisecanceler filter 325 can be maintained. More specifically, one copy ofthe time-varying statistics used to derive (or update) adaptive noisecanceler filter 325 can be maintained for use when blocking matrixfilter 315 is enabled and another copy of the time-varying statisticsused to derive (or update) adaptive noise canceler filter 325 can bemaintained for use when blocking matrix filter 315 is disabled.

4.2.1 Scaled Blocking Matrix

In practice it may be advantageous to use a switching mechanism to turnblocking matrix filter 315 partially on and partially off based on thelikelihood of speech being present in the primary input speech signalP(m, f), rather than using a hard switching mechanism that simply turnsblocking matrix filter 315 either completely on or completely off. Forexample, such a soft switching mechanism can be implemented as a scalingof the coefficients of blocking matrix filter 315 with a scaling factorhaving a value between zero and one that can be adjusted based on thelikelihood of desired speech being present in the primary input speechsignal P(m, f). A good estimate of the likelihood of desired speechbeing present in the primary input speech signal P(m, f) can becalculated from the difference in energy between the primary inputspeech signal P(m, f) and the reference input speech signal R(m, f).

Furthermore, it can be advantageous to make the scaling factor frequencydependent, as the desired speech source may occupy/dominate certainfrequency range(s) while a background noise source may occupy/dominate adifferent frequency range(s). Frequency dependency can be achieved bynot calculating the difference in energy between the primary inputspeech signal P(m, f) and the reference input speech signal R(m, f) on afull-band basis, but rather based on individual frequency bins, orgroups of frequency bins.

The frequency dependent level difference can be calculated as:

M _(frq)(m,f)=β_(r) ₁ ·M _(frq)(m−1,f)+(1+β_(r) ₁)·(10·log₁₀(P(m,f)P*(m,f))−10·log₁₀(R(m,f)R*(m,f)))  (73)

where P(m, f) and R(m, f) have already been subject to the microphonemismatch compensation. The scaled taps of blocking matrix filter 315 arecalculated according to:

$\begin{matrix}{{H\left( {m,f} \right)} = \left\{ \begin{matrix}0 & {{M_{frq}\left( {m,f} \right)} < T_{off}} \\{\frac{{M_{frq}\left( {m,f} \right)} - T_{off}}{T_{on} - T_{off}} \cdot \frac{C_{R,P^{*}}\left( {m,f} \right)}{C_{P,P^{*}}\left( {m,f} \right)}} & {T_{off} \leq {M_{frq}\left( {m,f} \right)} \leq T_{on}} \\\frac{C_{R,P^{*}}\left( {m,f} \right)}{C_{P,P^{*}}\left( {m,f} \right)} & {{M_{frq}\left( {m,f} \right)} > T_{on}}\end{matrix} \right.} & (74)\end{matrix}$

Hence, it equals the regular blocking matrix filter 315 during certaindesired speech presence (large microphone level difference at thespecific frequency bin), is completely off during certain desired speechabsence, and assumes a scaled version according to the microphone leveldifference at the specific frequency bin during uncertainty of desiredspeech presence. Example values of the parameters are T_(off)=3 dB andT_(off)=8 dB.

4.2.2 Adaptive Noise Canceler as a Function of the Blocking Matrix

A complication of having soft-decision in form of the blocking matrixscaling rather than a hard on-off switch is the inability to simplymaintaining two sets of statistics for the ANC section (onecorresponding to the blocking matrix on, and a second to the blockingmatrix off). The scaling of the blocking matrix will introduce a sourceof modulation into the output signal of the blocking matrix, on whichthe statistics for the ANC section are based, which could furthercomplicate the tracking of the ANC statistics. To address that, thesolution for the ANC section is further analyzed. The analysis is basedon the single complex tap, but can be applied to any of theformulations. From sub-section 2.2:

$\begin{matrix}\begin{matrix}{{C_{P,{\hat{N}}_{2}^{*}}(f)} = {\sum\limits_{m}{{P\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} \\{= {\sum\limits_{m}{{P\left( {m,f} \right)}\left( {{R\left( {m,f} \right)} - {{H(f)}{P\left( {m,f} \right)}}} \right)^{*}}}} \\{= {{\sum\limits_{m}{{P\left( {m,f} \right)}{R^{*}\left( {m,f} \right)}}} - {{H(f)}{\sum\limits_{m}{{p\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}}}} \\{= {{C_{P,R^{*}}(f)} - {{H(f)}{C_{P,P^{*}}(f)}}}}\end{matrix} & (75) \\\begin{matrix}{{C_{{\hat{N}}_{2},{\hat{N}}_{2}^{*}}(f)}\text{?}{\sum\limits_{m}{{{\hat{N}}_{2}\left( {m,f} \right)}{{\hat{N}}_{2}^{*}\left( {m,f} \right)}}}} \\{\text{?}{\sum\limits_{m}\left( {{R\left( {m,f} \right)} - {{H(f)}{P\left( {m,f} \right)}}} \right)}} \\{\left( {{R\left( {m,f} \right)} - {{H(f)}{P\left( {m,f} \right)}}} \right)^{*}} \\{= {{\sum\limits_{m}{{R\left( {m,f} \right)}{R^{*}\left( {m,f} \right)}}} +}} \\{{{{H(f)}{H^{*}(f)}{\sum\limits_{m}{{P\left( {m,f} \right)}{P^{*}\left( {m,f} \right)}}}} -}} \\{{2{Re}\left\{ {{H(f)}{\sum\limits_{m}{{P\left( {m,f} \right)}{R^{*}\left( {m,f} \right)}}}} \right\}}} \\{{\text{?}{C_{R,R^{*}}(f)}} + {{H(f)}{H^{*}(f)}{C_{P,P^{*}}(f)}} -} \\{{{2{Re}\left\{ {{H(f)}{C_{P,R^{*}}(f)}} \right\}}{\text{?}\text{indicates text missing or illegible when filed}}}}\end{matrix} & (76)\end{matrix}$

As opposed to sub-section 3.2 above, where the noise components ofC_(P,{circumflex over (N)}) ₂ _(*)(f) and C_({circumflex over (N)}) ₂_(,{circumflex over (N)}) ₂ _(*)(f) were tracked and estimated, thepresent solution requires tracking of the noise components ofC_(P,R*)(f), C_(P,P*)(f), and C_(R,R*)(f). From these estimates and theinstantaneous (scaled) blocking matrix filter 315, the estimates ofC_(P,{circumflex over (N)}) ₂ _(*)(f) and C_({circumflex over (N)}) ₂_(,{circumflex over (N)}) ₂ _(*)(f) can be calculated according to theabove two equations, providing the necessary statistics to calculate thefilter taps of adaptive noise canceler filter 325 according tosub-section 3.2. The necessary estimates of the statistics, C_(P,R*)(f),C_(P,P*)(f), and C_(R,R*)(f), are calculated equivalently to theestimates of the statistics of C_(P,{circumflex over (N)}) ₂ _(*)(f) andC_({circumflex over (N)}) ₂ _(,{circumflex over (N)}) ₂ _(*)(f) insub-section 3.2 and both a fast tracking and a slow tracking version ofthese statistics can be used:

C _(P,R*) ^(fast)(m,f)=γ_(fast)(m,f)·C _(P,P*)^(fast)(m−1,f)+(1−γ_(fast)(m,f))·P(m,f)R*(m,f)  (77)

C _(P,P*) ^(fast)(m,f)=γ_(fast)(m,f)·C _(P,P*)^(fast)(m−1,f)+(1−γ_(fast)(m,f))·P(m,f)P*(m,f)  (79)

and

C _(P,R*) ^(slow)(m,f)=γ_(slow)(m,f)·C _(P,R*)^(slow)(m−1,f)+(1−γ_(slow)(m,f))·P(m,f)P*(m,f)  (80)

C _(P,P*) ^(slow)(m,f)=γ_(slow)(m,f)·C _(P,P*)^(slow)(m−1,f)+(1−γ_(slow)(m,f))·P(m,f)P*(m,f)  (81)

C _(R,R*) ^(slow)(m,f)=γ_(slow)(m,f)·C _(R,R*)^(slow)(m−1,f)+(1−γ_(slow)(m,f))·R(m,f)R*(m,f)  (82)

Additionally, as indicated by the above equations the fast and slowadaptation factors γ_(fast)(m) and γ_(slow)(m) can be made frequencydependent by mapping the level difference on a frequency bin basis. Themapping can be identical to that of section 3.2, except for beingfrequency bin based instead of full-band based.

Yet a further refinement is to select taps from the fast and slowtracking ANCs on a frequency bin basis instead of a fall-band basis asin section 3.2:

E _(fast)(m,f)=|P(m,f)−W ^(fast)(f){circumflex over (N)} ₂(m,f)|²  (83)

E _(slow)(m,f)=|P(m,f)−W ^(slow)(f){circumflex over (N)} ₂(m,f)|²  (84)

where:

$\begin{matrix}{{W^{fast}\left( {m,f} \right)} = \frac{{C_{P,R^{*}}^{fast}\left( {m,f} \right)} - {{H(f)}{C_{P,P^{*}}^{fast}\left( {m,f} \right)}}}{\begin{matrix}{{C_{R,R^{*}}^{fast}\left( {m,f} \right)} + {{H(f)}{H^{*}(f)}{C_{P,P^{*}}^{fast}\left( {m,f} \right)}} -} \\{2{Re}\left\{ {{H(f)}{C_{P,R^{*}}^{fast}\left( {m,f} \right)}} \right\}}\end{matrix}}} & (85) \\{{W^{slow}\left( {m,f} \right)} = \frac{{C_{P,R^{*}}^{slow}\left( {m,f} \right)} - {{H(f)}{C_{P,P^{*}}^{slow}\left( {m,f} \right)}}}{\begin{matrix}{{C_{R,R^{*}}^{slow}\left( {m,f} \right)} + {{H(f)}{H^{*}(f)}{C_{P,P^{*}}^{slow}\left( {m,f} \right)}} -} \\{2{Re}\left\{ {{H(f)}{C_{P,R^{*}}^{slow}\left( {m,f} \right)}} \right\}}\end{matrix}}} & (86)\end{matrix}$

Hence, the final adaptive noise canceler filter 325 is selectedaccording to:

$\begin{matrix}{{W\left( {m,f} \right)} = \left\{ \begin{matrix}{W^{fast}\left( {m,f} \right)} & {{E_{fast}\left( {m,f} \right)} < {E_{slow}\left( {m,f} \right)}} \\{W^{slow}\left( {m,f} \right)} & {otherwise}\end{matrix} \right.} & (87)\end{matrix}$

5. Example Computer System Implementation

It will be apparent to persons skilled in the relevant art(s) thatvarious elements and features of the present invention, as describedherein, can be implemented in hardware using analog and/or digitalcircuits, in software, through the execution of instructions by one ormore general purpose or special-purpose processors, or as a combinationof hardware and software.

The following description of a general purpose computer system isprovided for the sake of completeness. Embodiments of the presentinvention can be implemented in hardware, or as a combination ofsoftware and hardware. Consequently, embodiments of the invention may beimplemented in the environment of a computer system or other processingsystem. An example of such a computer system 1200 is shown in FIG. 12.All of the modules depicted in FIGS. 3 and 8, and, for example, canexecute on one or more distinct computer systems 1200. Furthermore, eachof the steps of the flowcharts depicted in FIGS. 5, 7, 9, and 10 can beimplemented on one or more distinct computer systems 1200.

Computer system 1200 includes one or more processors, such as processor1204. Processor 1204 can be a special purpose or a general purposedigital signal processor. Processor 1204 is connected to a communicationinfrastructure 1202 (for example, a bus or network). Various softwareimplementations are described in terms of this exemplary computersystem. After reading this description, it will become apparent to aperson skilled in the relevant art(s) how to implement the inventionusing other computer systems and/or computer architectures.

Computer system 1200 also includes a main memory 1206, preferably randomaccess memory (RAM), and may also include a secondary memory 1208.Secondary memory 1208 may include, for example, a hard disk drive 1210and/or a removable storage drive 1212, representing a floppy disk drive,a magnetic tape drive, an optical disk drive, or the like. Removablestorage drive 1212 reads from and/or writes to a removable storage unit1216 in a well-known manner. Removable storage unit 1216 represents afloppy disk, magnetic tape, optical disk, or the like, which is read byand written to by removable storage drive 1212. As will be appreciatedby persons skilled in the relevant art(s), removable storage unit 1216includes a computer usable storage medium having stored therein computersoftware and/or data.

In alternative implementations, secondary memory 1208 may include othersimilar means for allowing computer programs or other instructions to beloaded into computer system 1200. Such means may include, for example, aremovable storage unit 1218 and an interface 1214. Examples of suchmeans may include a program cartridge and cartridge interface (such asthat found in video game devices), a removable memory chip (such as anEPROM, or PROM) and associated socket, a thumb drive and USB port, andother removable storage units 1218 and interfaces 1214 which allowsoftware and data to be transferred from removable storage unit 1218 tocomputer system 1200.

Computer system 1200 may also include a communications interface 1220.Communications interface 1220 allows software and data to be transferredbetween computer system 1200 and external devices. Examples ofcommunications interface 1220 may include a modem, a network interface(such as an Ethernet card), a communications port, a PCMCIA slot andcard, etc. Software and data transferred via communications interface1220 are in the form of signals which may be electronic,electromagnetic, optical, or other signals capable of being received bycommunications interface 1220. These signals are provided tocommunications interface 1220 via a communications path 1222.Communications path 1222 carries signals and may be implemented usingwire or cable, fiber optics, a phone line, a cellular phone link, an RFlink and other communications channels.

As used herein, the terms “computer program medium” and “computerreadable medium” are used to generally refer to tangible storage mediasuch as removable storage units 1216 and 1218 or a hard disk installedin hard disk drive 1210. These computer program products are means forproviding software to computer system 1200.

Computer programs (also called computer control logic) are stored inmain memory 1206 and/or secondary memory 1208. Computer programs mayalso be received via communications interface 1220. Such computerprograms, when executed, enable the computer system 1200 to implementthe present invention as discussed herein. In particular, the computerprograms, when executed, enable processor 1204 to implement theprocesses of the present invention, such as any of the methods describedherein. Accordingly, such computer programs represent controllers of thecomputer system 1200. Where the invention is implemented using software,the software may be stored in a computer program product and loaded intocomputer system 1200 using removable storage drive 1212, interface 1214,or communications interface 1220.

In another embodiment, features of the invention are implementedprimarily in hardware using, for example, hardware components such asapplication-specific integrated circuits (ASICs) and gate arrays.Implementation of a hardware state machine so as to perform thefunctions described herein will also be apparent to persons skilled inthe relevant art(s).

6. Conclusion

The present invention has been described above with the aid offunctional building blocks illustrating the implementation of specifiedfunctions and relationships thereof. The boundaries of these functionalbuilding blocks have been arbitrarily defined herein for the convenienceof the description. Alternate boundaries can be defined so long as thespecified functions and relationships thereof are appropriatelyperformed.

In addition, while various embodiments have been described above, itshould be understood that they have been presented by way of exampleonly, and not limitation. It will be understood by those skilled in therelevant art(s) that various changes in form and details can be made tothe embodiments described herein: without departing from the spirit andscope of the invention as defined in the appended claims. Accordingly,the breadth and scope of the present invention should not be limited byany of the above-described exemplary embodiments, but should be definedonly in accordance with the following claims and their equivalents.

1. A system for suppressing noise in a primary input speech signal thatcomprises a first desired speech component and a first background noisecomponent using a reference input speech signal that comprises a seconddesired speech component and a second background noise component, thesystem comprising: a blocking matrix configured to filter the primaryinput speech signal, in accordance with a first transfer function, toestimate the second desired speech component and to remove the estimateof the second desired speech component from the reference input speechsignal to provide a “cleaner” second background noise component; anadaptive noise canceler configured to filter the “cleaner” secondbackground noise component, in accordance with a second transferfunction, to estimate the first background noise component and to removethe estimate of the first background noise component from the primaryinput speech signal to provide a noise suppressed primary input speechsignal, wherein the first transfer function is determined based onstatistics of the first desired speech component and the second desiredspeech component, and the second transfer function is determined basedon statistics of the primary input speech signal and the “cleaner”second background noise component.
 2. The system of claim 1, wherein thestatistics of the first desired speech component and the second desiredspeech component comprise: desired speech statistics of the primaryinput speech signal determined based on an estimate of a power spectrumof the first desired speech component, and desired speech cross-channelstatistics determined based on an estimate of a cross-spectrum betweenthe first desired speech component and the second desired speechcomponent.
 3. The system of claim 2, wherein the blocking matrixcomprises a statistics estimator configured to: estimate the powerspectrum of the first desired speech component based on a product of aspectrum of the primary input speech signal and a complex conjugate ofthe spectrum of the primary input speech signal, and update the desiredspeech statistics of the primary input speech signal with the product ofthe spectrum of the primary input speech signal and the complexconjugate of the spectrum of the primary input speech signal at a raterelated to a difference in energy or level between the primary inputspeech signal and the reference input speech signal.
 4. The system ofclaim 2, wherein the blocking matrix comprises a statistics estimatorconfigured to: estimate the cross-spectrum between the first desiredspeech component and the second desired speech component based on aspectrum of the reference input speech signal and the spectrum of theprimary input speech signal, and update the desired speech cross-channelstatistics based on the spectrum of the reference input speech signaland the spectrum of the primary input speech signal at a rate related toa difference in energy or level between the primary input speech signaland the reference input speech signal.
 5. The system of claim 1, whereinthe first transfer function is further determined based on statistics ofthe first background noise component and the second background noisecomponent.
 6. The system of claim 5, wherein the statistics of the firstbackground noise component and the second background noise componentcomprise: stationary background noise statistics of the primary inputspeech signal determined based on a spectrum of the primary input speechsignal, and stationary background noise cross-channel statisticsdetermined based on a spectrum of the primary input speech signal and aspectrum of the reference input speech signal.
 7. The system of claim 6,wherein the blocking matrix comprises a statistics estimator configuredto: update the stationary background noise statistics of the primaryinput speech signal with the product of the spectrum of the primaryinput speech signal and the complex conjugate of the spectrum of theprimary input speech signal at a rate related to a difference in energyor level between the primary input speech signal and the reference inputspeech signal.
 8. The system of claim 6, wherein the blocking matrixcomprises a statistics estimator configured to: update the stationarybackground noise cross-channel statistics based on the spectrum of theprimary input speech signal and the spectrum of the reference inputspeech signal at a rate related to a difference in energy or levelbetween the primary input speech signal and the reference input speechsignal.
 9. The system of claim 1, wherein the statistics of the primaryinput speech signal and the “cleaner” second background noise componentcomprise: background noise statistics determined based on a product of aspectrum of the “cleaner” second background noise component and acomplex conjugate of the spectrum of the “cleaner” second backgroundnoise component, and cross-channel background noise statisticsdetermined based on a spectrum of the primary input speech signal andthe spectrum of the “cleaner” second background noise component.
 10. Thesystem of claim 9, wherein the adaptive noise canceler comprises astatistics estimator configured to: update the background noisestatistics with the product of the spectrum of the “cleaner” secondbackground noise component and the complex conjugate of the spectrum ofthe “cleaner” second background noise component at a rate related to adifference in energy or level between the primary input speech signaland the “cleaner” second background noise component.
 11. The system ofclaim 9, wherein the adaptive noise canceler comprises a statisticsestimator configured to: update the cross-channel background noisestatistics based on the spectrum of the primary input speech signal andthe spectrum of the “cleaner” second background noise component at arate related to a difference in energy or level between the primaryinput speech signal and the “cleaner” second background noise component.12. The system of claim 9, wherein the adaptive noise canceler comprisesa statistics estimator configured to: update a fast version of thebackground noise statistics with the product of the spectrum of the“cleaner” second background noise component and the complex conjugate ofthe spectrum of the “cleaner” second background noise component at afirst rate related to a difference in energy or level between theprimary input speech signal and the “cleaner” second background noisecomponent, update a slow version of the background noise statistics withthe product of the spectrum of the “cleaner” second background noisecomponent and the complex conjugate of the spectrum of the “cleaner”second background noise component at a second rate different from thefirst rate and related to a difference in energy or level between theprimary input speech signal and the “cleaner” second background noisecomponent, and select between the fast version of the background noisestatistics and the slow version of the background noise statistics todetermine the second transfer function based on which background noisestatistics results in the noise suppressed primary input speech signalhaving a smaller energy.
 13. The system of claim 9, wherein the adaptivenoise canceler comprises a statistics estimator configured to: update afast version of the cross-channel background noise statistics based onthe spectrum of the primary input speech signal and the spectrum of the“cleaner” second background noise component at a first rate related to adifference in energy or level between the primary input speech signaland the “cleaner” second background noise component, update a slowversion of the cross-channel background noise statistics based on thespectrum of the primary input speech component and the spectrum of the“cleaner” second background noise component at a second rate differentfrom the first rate and related to a difference in energy or levelbetween the primary input speech signal and the “cleaner” secondbackground noise component, and select between the fast version of thecross-channel background noise statistics and the slow version of thecross-channel background noise statistics to determine the secondtransfer function based on which cross-channel background noisestatistics results in the noise suppressed primary input speech signalhaving a smaller energy.
 14. The system of claim 1, wherein the blockingmatrix receives and processes the primary input speech signal in thefrequency domain.
 15. The system of claim 1, wherein the blocking matrixreceives and processes the primary input speech signal in the frequencydomain using a plurality of time direction filters, each time directionfilter configured to filter a different sub-band or frequency componentof the primary input speech signal.
 16. The system of claim 1, whereinthe adaptive noise canceler receives and processes the “cleaner” secondbackground noise component in the frequency domain.
 17. The system ofclaim 1, wherein the adaptive noise canceler receives and processes the“cleaner” second background noise signal in the frequency domain using aplurality of time direction filters, each time direction filterconfigured to filter a different sub-band or frequency component of the“cleaner” second background noise component.
 18. The system of claim 1,further comprising: a microphone mismatch estimator configured toestimate a difference in microphone sensitivity between a firstmicrophone that receives the primary input speech signal and a secondmicrophone that receives the reference input speech signal.
 19. Thesystem of claim 18, wherein the microphone mismatch estimator is furtherconfigured to identify a presence of a diffuse sound field at least inpart based on the primary input speech signal and the reference inputspeech signal and update the estimated difference in microphonesensitivity when the presence of a diffuse sound field is identified.20-29. (canceled)
 30. A method for suppressing noise in a primary signalthat comprises a first desired speech signal and a first noise signalusing a reference signal that comprises a second desired speech signaland a second noise signal, the method comprising: filtering the primaryinput speech signal in accordance with a first transfer function toestimate the second desired speech signal; removing the estimate of thesecond desired speech signal from the reference signal to provide a“cleaner” second noise signal; filtering the “cleaner” second noisesignal, in accordance with a second transfer function, to estimate thefirst noise signal; removing the estimate of the first noise signal fromthe primary signal to provide a noise suppressed primary signal;determining the first transfer function based on statistics of the firstdesired speech signal and the second desired speech signal; anddetermining the second transfer function based on statistics of theprimary signal and the “cleaner” second noise signal.